Looking for Senior AWS Serverless Architects & Engineers?
Let's TalkIntroduction
Unhandled errors and exceptions which occur during a state machine execution will cause the entire process to fail. The good news is, these errors can be handled.
AWS Step Functions provides two main mechanisms to handle these errors which are the Retry and Catch mechanisms. In this article, we are going to discuss the errors that can be encountered during a state machine execution and demonstrate how to handle these errors using these mechanisms.
Let’s dive in!
Where do these errors come from?
All states in a state machine definition (with the exception of 'Pass' and 'Wait' states) can encounter errors during runtime. These states do not encounter errors because they do not contain any custom operations. Errors encountered during runtime can be the following:
- Task Failures: This occurs when a task throws an exception during execution, for example, a DynamoDB 'PutItem' action that throws an 'InternalServerError' Exception.
- State Machine Definition Error: This is the error that is encountered when there is a mismatch in a rule configuration.
- Transient Errors: These are momentary errors that usually resolve themselves after a short period of time. They are almost non-reproducible. Leading causes can be a momentary network or connectivity issue, issues while dynamically recycling computing units on the cloud, additional connection latency introduced by proxy resources, servers, etc.
These errors, regardless of their nature can be handled.
For more information on 'States' in a state machine configuration, check out this AWS Documentation on Step Function States
Retry Mechanism
As the name suggests, this mechanism when configured in a state, retries the state if it throws an exception during runtime. To configure this mechanism of your state(s), you must specify retry condition(s) in the parameter 'ErrorEquals' which is an array or list of error names or exceptions that were encountered by the state.
Additionally, you can configure a maximum number of retries 'MaxAttempts', a wait time in seconds before the first retry 'IntervalSeconds', and an exponential backoff 'BackoffRate' which is an algorithm that uses feedback to multiplicatively decrease the retry interval.
Let’s create a simple state machine definition with a single step. This step will be a lambda function that throws a 'ServiceException' error regardless of the input.
We will be building this solution using the Serverless framework, serverless-step-function plugin, and @aws-sdk/smithy-client to access AWS error classes.
Base Setup
Run the commands below to set up our node project and install the dependencies that we will use to build our solution.
git clone https://github.com/NwekeChidi/serverless_labs.git
cd serverless_labs/handling_stepFunction_errors
Install the dependencies:
npm install
For the Lambda function, we are just going to create a function that fails woefully and throws the 'ServiceException' error.
Copy the code below into your 'hello-error.js' file:
const { ServiceException } = require("@aws-sdk/smithy-client");
module.exports.handler = async (event, context) => {
throw new ServiceException({
name: "Lambda.ServiceException",
message: "An uninterruptible error",
$fault: "client"
});
}
Copy the YAML configuration below into your 'serverless.yml' file:
service: handling-sf-errors
frameworkVersion: "3"
plugins:
- serverless-step-functions
provider:
name: aws
region: us-west-2
functions:
hello-error:
handler: hello-error.handler
stepFunctions:
stateMachines:
error-SM:
name: error-stateMachine
definition:
Comment: State Machine riddled with errors!
StartAt: Hello Error
States:
Hello Error:
Type: Task
Resource:
Fn::GetAtt: [hello-error, Arn]
OutputPath: "$.Payload"
Parameters:
Payload.$: "$"
Retry:
- ErrorEquals:
- Lambda.ServiceException
IntervalSeconds: 2
MaxAttempts: 3
BackoffRate: 2
End: true
Deploy the solution:
sls deploy --verbose
Once your deployment is complete, navigate to your 'Step Functions' Console on your AWS account, select State machines from the sidebar, and select 'error-stateMachine' which we just deployed.
Select 'Start execution' and run the process with the default placeholder payload.
You will notice the process starting the retry mechanism after the initial execution fails and will continue to retry that state until it reaches the maximum retry attempts configured while observing an exponential backoff between each retry.
Catch Mechanism
This is similar to a JavaScript 'try-and-catch' or a Python 'try-and-except' block. This mechanism expects a non-optional 'FallBack' state which is a practical state the execution process reverts or falls back to in the event of an exception. This state can be a resource that performs some additional operations in response to the error.
To see this error in practice, we are going to reuse the same configurations from the Retry Mechanism example with a few modifications. We are going to add an additional step to send the failed payload as is to an SQS queue.
Update your 'serverless.yml' with the configuration below and redeploy:
service: handling-sf-errors-catch
frameworkVersion: "3"
plugins:
- serverless-step-functions
provider:
name: aws
region: us-west-2
functions:
hello-error:
handler: hello-error.handler
resources:
Resources:
MyQueue:
Type: AWS::SQS::Queue
Properties:
QueueName: our-queue
stepFunctions:
stateMachines:
error-SM:
name: error-catch-stateMachine
definition:
Comment: State Machine riddled with errors!
StartAt: Hello Error
States:
Hello Error:
Type: Task
Resource:
Fn::GetAtt: [hello-error, Arn]
OutputPath: "$.Payload"
Parameters:
Payload.$: "$"
Catch:
- ErrorEquals:
- Lambda.ServiceException
Next: SendToDLQ
ResultPath: null
End: true
SendToDLQ:
Type: Task
Resource: "arn:aws:state:::sqs:sendMessage"
Parameters:
MessageBody.$: "$"
QueueUrl:
Fn::GetAtt: [DLQ, QueueUrl]
End: true
Once your deployment is complete, select the 'error-catch-stateMachine' on your 'State machines' page and start an execution using the same default payload. You will notice that the 'SendToDLQ' state is run immediately after the 'Hello Error' state fails.
NB: The ASL Error: 'States.Runtime' will always cause the execution process to fail. Learn more about Step Function Error Names and their definitions here.
Conclusion
In this article, we have seen how to successfully handle errors in our step functions using the Retry or the Catch mechanism.
The right mechanism to use in your state is largely dependent on your use case and the exception encountered. You can combine these two mechanisms if you want to handle errors that might be raised by your state differently. For example, you can have a Retry config that handles 'ServiceException' errors and a Catch config that handles 'AWSLambdaException' and 'States.IntrinsicFailure' errors.
And as always, remove the stacks we deployed afterward to avoid any costs using the 'sls remove' command.
Sayonara🙂