Handling Errors in Step Functions

Introduction

Unhandled errors and exceptions which occur during a state machine execution will cause the entire process to fail. The good news is, these errors can be handled.

AWS Step Functions provides two main mechanisms to handle these errors which are the Retry and Catch mechanisms. In this article, we are going to discuss the errors that can be encountered during a state machine execution and demonstrate how to handle these errors using these mechanisms.

Let’s dive in!

Where do these errors come from?

All states in a state machine definition (with the exception of 'Pass' and 'Wait' states) can encounter errors during runtime. These states do not encounter errors because they do not contain any custom operations. Errors encountered during runtime can be the following:

  1. Task Failures: This occurs when a task throws an exception during execution, for example, a DynamoDB 'PutItem' action that throws an 'InternalServerError' Exception.
  2. State Machine Definition Error: This is the error that is encountered when there is a mismatch in a rule configuration.
  3. Transient Errors: These are momentary errors that usually resolve themselves after a short period of time. They are almost non-reproducible. Leading causes can be a momentary network or connectivity issue, issues while dynamically recycling computing units on the cloud, additional connection latency introduced by proxy resources, servers, etc.

These errors, regardless of their nature can be handled.

For more information on 'States' in a state machine configuration, check out this AWS Documentation on Step Function States

Retry Mechanism

As the name suggests, this mechanism when configured in a state, retries the state if it throws an exception during runtime. To configure this mechanism of your state(s), you must specify retry condition(s) in the parameter 'ErrorEquals' which is an array or list of error names or exceptions that were encountered by the state.

Additionally, you can configure a maximum number of retries 'MaxAttempts', a wait time in seconds before the first retry 'IntervalSeconds', and an exponential backoff 'BackoffRate' which is an algorithm that uses feedback to multiplicatively decrease the retry interval.

Let’s create a simple state machine definition with a single step. This step will be a lambda function that throws a 'ServiceException' error regardless of the input.

We will be building this solution using the Serverless framework, serverless-step-function plugin, and @aws-sdk/smithy-client to access AWS error classes.

Base Setup

Run the commands below to set up our node project and install the dependencies that we will use to build our solution.

git clone https://github.com/NwekeChidi/serverless_labs.git
cd serverless_labs/handling_stepFunction_errors

Install the dependencies:

npm install

For the Lambda function, we are just going to create a function that fails woefully and throws the 'ServiceException' error.

Copy the code below into your 'hello-error.js' file:

const { ServiceException } = require("@aws-sdk/smithy-client");

module.exports.handler = async (event, context) => {
	throw new ServiceException({
		name: "Lambda.ServiceException",
		message: "An uninterruptible error",
		$fault: "client"
	});
}

Copy the YAML configuration below into your 'serverless.yml' file:

service: handling-sf-errors
frameworkVersion: "3"

plugins:
	- serverless-step-functions
provider:
	name: aws
	region: us-west-2

functions:
	hello-error:
		handler: hello-error.handler
stepFunctions:
	stateMachines:
		error-SM:
			name: error-stateMachine
			definition:
				Comment: State Machine riddled with errors!
				StartAt: Hello Error
				States:
					Hello Error:
						Type: Task
						Resource:
							Fn::GetAtt: [hello-error, Arn]
						OutputPath: "$.Payload"
						Parameters:
						  Payload.$: "$"
						Retry:
							- ErrorEquals:
								- Lambda.ServiceException
							IntervalSeconds: 2
							MaxAttempts: 3
							BackoffRate: 2
						End: true

Deploy the solution:

sls deploy --verbose

Once your deployment is complete, navigate to your 'Step Functions' Console on your AWS account, select State machines from the sidebar, and select 'error-stateMachine' which we just deployed.

Select 'Start execution' and run the process with the default placeholder payload.

You will notice the process starting the retry mechanism after the initial execution fails and will continue to retry that state until it reaches the maximum retry attempts configured while observing an exponential backoff between each retry.

Catch Mechanism

This is similar to a JavaScript 'try-and-catch' or a Python 'try-and-except' block. This mechanism expects a non-optional 'FallBack' state which is a practical state the execution process reverts or falls back to in the event of an exception. This state can be a resource that performs some additional operations in response to the error.

To see this error in practice, we are going to reuse the same configurations from the Retry Mechanism example with a few modifications. We are going to add an additional step to send the failed payload as is to an SQS queue.

Update your 'serverless.yml' with the configuration below and redeploy:

service: handling-sf-errors-catch
frameworkVersion: "3"

plugins:
	- serverless-step-functions
provider:
	name: aws
	region: us-west-2

functions:
	hello-error:
	  handler: hello-error.handler

resources:
	Resources:
		MyQueue:
			Type: AWS::SQS::Queue
			Properties:
				QueueName: our-queue

stepFunctions:
	stateMachines:
	 	error-SM:
			name: error-catch-stateMachine
			definition:
				Comment: State Machine riddled with errors!
				StartAt: Hello Error
				States:
					Hello Error:
						Type: Task
						Resource:
							Fn::GetAtt: [hello-error, Arn]
						OutputPath: "$.Payload"
						Parameters:
						  Payload.$: "$"
						Catch:
							- ErrorEquals:
								- Lambda.ServiceException
							Next: SendToDLQ
							ResultPath: null
						End: true
					SendToDLQ:
						Type: Task
						Resource: "arn:aws:state:::sqs:sendMessage"
						Parameters:
							MessageBody.$: "$"
							QueueUrl:
								Fn::GetAtt: [DLQ, QueueUrl]
						End: true

Once your deployment is complete, select the 'error-catch-stateMachine' on your 'State machines' page and start an execution using the same default payload. You will notice that the 'SendToDLQ' state is run immediately after the 'Hello Error' state fails.

NB: The ASL Error: 'States.Runtime' will always cause the execution process to fail. Learn more about Step Function Error Names and their definitions here.

Conclusion

In this article, we have seen how to successfully handle errors in our step functions using the Retry or the Catch mechanism.

The right mechanism to use in your state is largely dependent on your use case and the exception encountered. You can combine these two mechanisms if you want to handle errors that might be raised by your state differently. For example, you can have a Retry config that handles 'ServiceException' errors and a Catch config that handles 'AWSLambdaException' and 'States.IntrinsicFailure' errors.

And as always, remove the stacks we deployed afterward to avoid any costs using the 'sls remove' command.

Sayonara🙂

References

Serverless Handbook
Access free book

The dream team

At Serverless Guru, we're a collective of proactive solution finders. We prioritize genuineness, forward-thinking vision, and above all, we commit to diligently serving our members each and every day.

See open positions

Looking for skilled architects & developers?

Join businesses around the globe that trust our services. Let's start your serverless journey. Get in touch today!
Ryan Jones - Founder
Ryan Jones
Founder
Speak to a Guru
arrow
Edu Marcos
Chief Technology Officer
Speak to a Guru
arrow
Mason Toberny
Mason Toberny
Head of Enterprise Accounts
Speak to a Guru
arrow

Join the Community

Gather, share, and learn about AWS and serverless with enthusiasts worldwide in our open and free community.