Handling Errors in Step Functions

const { ServiceException } = require("@aws-sdk/smithy-client"); module.exports.handler = async (event, context) => { throw new ServiceException({ name: "Lambda.ServiceException", message: "An uninterruptible error", $fault: "client" }); }

service: handling-sf-errors frameworkVersion: "3" plugins: - serverless-step-functions provider: name: aws region: us-west-2 functions: hello-error: handler: hello-error.handler stepFunctions: stateMachines: error-SM: name: error-stateMachine definition: Comment: State Machine riddled with errors! StartAt: Hello Error States: Hello Error: Type: Task Resource: Fn::GetAtt: [hello-error, Arn] OutputPath: "$.Payload" Parameters: Payload.$: "$" Retry: - ErrorEquals: - Lambda.ServiceException IntervalSeconds: 2 MaxAttempts: 3 BackoffRate: 2 End: true

service: handling-sf-errors-catch frameworkVersion: "3" plugins: - serverless-step-functions provider: name: aws region: us-west-2 functions: hello-error: handler: hello-error.handler resources: Resources: MyQueue: Type: AWS::SQS::Queue Properties: QueueName: our-queue stepFunctions: stateMachines: error-SM: name: error-catch-stateMachine definition: Comment: State Machine riddled with errors! StartAt: Hello Error States: Hello Error: Type: Task Resource: Fn::GetAtt: [hello-error, Arn] OutputPath: "$.Payload" Parameters: Payload.$: "$" Catch: - ErrorEquals: - Lambda.ServiceException Next: SendToDLQ ResultPath: null End: true SendToDLQ: Type: Task Resource: "arn:aws:state:::sqs:sendMessage" Parameters: MessageBody.$: "$" QueueUrl: Fn::GetAtt: [DLQ, QueueUrl] End: true

Let's Talk

Introduction

Unhandled errors and exceptions which occur during a state machine execution will cause the entire process to fail. The good news is, these errors can be handled.

AWS Step Functions provides two main mechanisms to handle these errors which are the Retry and Catch mechanisms. In this article, we are going to discuss the errors that can be encountered during a state machine execution and demonstrate how to handle these errors using these mechanisms.

Let’s dive in!

Where do these errors come from?

All states in a state machine definition (with the exception of 'Pass' and 'Wait' states) can encounter errors during runtime. These states do not encounter errors because they do not contain any custom operations. Errors encountered during runtime can be the following:

Task Failures: This occurs when a task throws an exception during execution, for example, a DynamoDB 'PutItem' action that throws an 'InternalServerError' Exception.
State Machine Definition Error: This is the error that is encountered when there is a mismatch in a rule configuration.
Transient Errors: These are momentary errors that usually resolve themselves after a short period of time. They are almost non-reproducible. Leading causes can be a momentary network or connectivity issue, issues while dynamically recycling computing units on the cloud, additional connection latency introduced by proxy resources, servers, etc.

These errors, regardless of their nature can be handled.

For more information on 'States' in a state machine configuration, check out this AWS Documentation on Step Function States

Retry Mechanism

As the name suggests, this mechanism when configured in a state, retries the state if it throws an exception during runtime. To configure this mechanism of your state(s), you must specify retry condition(s) in the parameter 'ErrorEquals' which is an array or list of error names or exceptions that were encountered by the state.

Additionally, you can configure a maximum number of retries 'MaxAttempts', a wait time in seconds before the first retry 'IntervalSeconds', and an exponential backoff 'BackoffRate' which is an algorithm that uses feedback to multiplicatively decrease the retry interval.

Let’s create a simple state machine definition with a single step. This step will be a lambda function that throws a 'ServiceException' error regardless of the input.

We will be building this solution using the Serverless framework, serverless-step-function plugin, and @aws-sdk/smithy-client to access AWS error classes.

Base Setup

Run the commands below to set up our node project and install the dependencies that we will use to build our solution.

git clone https://github.com/NwekeChidi/serverless_labs.git
cd serverless_labs/handling_stepFunction_errors

Install the dependencies:

npm install

For the Lambda function, we are just going to create a function that fails woefully and throws the 'ServiceException' error.

Copy the code below into your 'hello-error.js' file:

const { ServiceException } = require("@aws-sdk/smithy-client");

module.exports.handler = async (event, context) => {
	throw new ServiceException({
		name: "Lambda.ServiceException",
		message: "An uninterruptible error",
		$fault: "client"
	});
}

Copy the YAML configuration below into your 'serverless.yml' file:

service: handling-sf-errors
frameworkVersion: "3"

plugins:
	- serverless-step-functions
provider:
	name: aws
	region: us-west-2

functions:
	hello-error:
		handler: hello-error.handler
stepFunctions:
	stateMachines:
		error-SM:
			name: error-stateMachine
			definition:
				Comment: State Machine riddled with errors!
				StartAt: Hello Error
				States:
					Hello Error:
						Type: Task
						Resource:
							Fn::GetAtt: [hello-error, Arn]
						OutputPath: "$.Payload"
						Parameters:
						  Payload.$: "$"
						Retry:
							- ErrorEquals:
								- Lambda.ServiceException
							IntervalSeconds: 2
							MaxAttempts: 3
							BackoffRate: 2
						End: true

Deploy the solution:

sls deploy --verbose

Once your deployment is complete, navigate to your 'Step Functions' Console on your AWS account, select State machines from the sidebar, and select 'error-stateMachine' which we just deployed.

Select 'Start execution' and run the process with the default placeholder payload.

You will notice the process starting the retry mechanism after the initial execution fails and will continue to retry that state until it reaches the maximum retry attempts configured while observing an exponential backoff between each retry.

Catch Mechanism

This is similar to a JavaScript 'try-and-catch' or a Python 'try-and-except' block. This mechanism expects a non-optional 'FallBack' state which is a practical state the execution process reverts or falls back to in the event of an exception. This state can be a resource that performs some additional operations in response to the error.

To see this error in practice, we are going to reuse the same configurations from the Retry Mechanism example with a few modifications. We are going to add an additional step to send the failed payload as is to an SQS queue.

Update your 'serverless.yml' with the configuration below and redeploy:

service: handling-sf-errors-catch
frameworkVersion: "3"

plugins:
	- serverless-step-functions
provider:
	name: aws
	region: us-west-2

functions:
	hello-error:
	  handler: hello-error.handler

resources:
	Resources:
		MyQueue:
			Type: AWS::SQS::Queue
			Properties:
				QueueName: our-queue

stepFunctions:
	stateMachines:
	 	error-SM:
			name: error-catch-stateMachine
			definition:
				Comment: State Machine riddled with errors!
				StartAt: Hello Error
				States:
					Hello Error:
						Type: Task
						Resource:
							Fn::GetAtt: [hello-error, Arn]
						OutputPath: "$.Payload"
						Parameters:
						  Payload.$: "$"
						Catch:
							- ErrorEquals:
								- Lambda.ServiceException
							Next: SendToDLQ
							ResultPath: null
						End: true
					SendToDLQ:
						Type: Task
						Resource: "arn:aws:state:::sqs:sendMessage"
						Parameters:
							MessageBody.$: "$"
							QueueUrl:
								Fn::GetAtt: [DLQ, QueueUrl]
						End: true

Once your deployment is complete, select the 'error-catch-stateMachine' on your 'State machines' page and start an execution using the same default payload. You will notice that the 'SendToDLQ' state is run immediately after the 'Hello Error' state fails.

NB: The ASL Error: 'States.Runtime' will always cause the execution process to fail. Learn more about Step Function Error Names and their definitions here.

Conclusion

In this article, we have seen how to successfully handle errors in our step functions using the Retry or the Catch mechanism.

The right mechanism to use in your state is largely dependent on your use case and the exception encountered. You can combine these two mechanisms if you want to handle errors that might be raised by your state differently. For example, you can have a Retry config that handles 'ServiceException' errors and a Catch config that handles 'AWSLambdaException' and 'States.IntrinsicFailure' errors.

And as always, remove the stacks we deployed afterward to avoid any costs using the 'sls remove' command.

Sayonara🙂

References

https://docs.aws.amazon.com/step-functions/latest/dg/concepts-error-handling.html#error-handling-examples

Nweke Chidimma Dominic Gabriel

Sr. Serverless Developer

Dedicated to crafting cutting-edge, intelligent, & scalable solutions with a passion for solving complex puzzles and classical music.

Handling Errors in Step Functions

Introduction

Where do these errors come from?

Retry Mechanism

Base Setup

Catch Mechanism

Conclusion

References

The dream team

Looking for skilled architects & developers?

More from Serverless Guru

Automate Brand Visibility Tracking With Amazon Rekognition

Building A Translation And Transcription Application Using AWS Transcribe, And Translate

Short Story Generator with AWS Bedrock and Amplify

Join the Community

Handling Errors in Step Functions

Looking for Senior AWS Serverless Architects & Engineers?

Introduction

Where do these errors come from?

Retry Mechanism

Base Setup

Catch Mechanism

Conclusion

References

Nweke Chidimma Dominic Gabriel

More from Serverless Guru

Automate Brand Visibility Tracking With Amazon Rekognition

Building A Translation And Transcription Application Using AWS Transcribe, And Translate

Short Story Generator with AWS Bedrock and Amplify

The Evolution of Serverless: From Compute to Full-Stack Cloud Architectures

How to Simplify Remote Database Access with AWS Session Manager

Custom DNS Resolution Across VPCs with Route53 Private Hosted Zones: Step-by-Step Tutorial - 2