Complex Serverless Exceptions with Step Functions

Introduction

In the realm of serverless architecture, a critical aspect to highlight is the resilience and fault tolerance of our services. Ensuring that these services can gracefully handle failures is crucial, but it often requires significant effort. Given the inherent limitations of AWS Lambda, such as execution timeouts and resource constraints, handling complex exceptions solely within Lambda functions can be challenging. Therefore, in this article, we will explore an effective approach to address this issue using AWS Step Functions (SFN).

The goal of this article is to provide an in-depth look at how AWS Step Functions can be used to manage complex exceptions in serverless architectures, ensuring robust and resilient workflows.

How can I use Step Functions to handle long time exceptions?

Let's consider a simple scenario where we have a Lambda function (contract-lambda) responsible for saving new contracts in a database. This function is triggered by another service that asynchronously returns a request ID to a user, so if an error occurs they won’t be notified at the same time, so that before saving a contract, contract-lambda must validate whether the user is eligible for the contract. The user data required for this validation comes from another Lambda function (user-lambda), which reads user information from an updated database. This updated database is populated via AWS Database Migration Service (DMS) from a legacy database that is still in use. Not considering the migration part of this architecture, we should have this representation:

A lambda called contract-lambda invoking a second one called user-lambda.
”Simple serverless architecture example”

Here’s where the challenge arises: If a new user exists only in the legacy database and the DMS process has not migrated their data to the non-legacy database yet , user-lambda will not find the necessary user information for validation. This missing information creates a significant problem, as contract-lambda cannot proceed without it. This is where Error Handler SFN comes in.

A lambda called contract-lambda invoking a second one called user-lambda. The both access dynamodb. When need they call the stepfunction error handler.
”Simple serverless architecture example calling the step functions error handler"

Now, let’s define our step functions to handle errors that are not easily handled by Lambda. The first thing I recommend is to create a payload to use on your Step Function. Feel free to use this model:

{
	"origin": "String", 
	"target": "String",
	"exception": "String"
	"fallbackTarget": "String", 
	"statePayload": {},
	"maxAttempts": 0,
}
  • origin: String (Required)
    • Description: Indicates the source of the error.
    • Example: "contract-lambda" or "user-lambda"
  • exception: String (Required)
    • Description: Exception (Error) that caused the entire problem.
    • Example: "MyCustomException"
  • target: String (Optional)
    • Description: Indicates the bypass flow to handle the exception.
    • Example: "ALERT_DEV_TEAM"
  • fallbackTarget: String (Required if origin target is not set)
    • Description: The alternative way for exceptions that could not be handled by the primary flow. Same value of target field.
    • Example: "ALERT_DEV_TEAM"
  • statePayload: Object (Required if the origin is set)
    • Description: Used to retry the origin execution with the necessary state information.
    • Example: { "body": {"userId": "12345", "contractId": "67890"} }
  • maxAttempts: Number (Required)
    • Description: Indicates how many times you want to retry the exception flow.
    • Example: 3

Now, considering these fields, we can create the “logic model” of our SFN:

The logic model of error handler state machine
”The logic model of error handler state machine”

Briefly, looking from the top of the diagram above we have the simple validation of our bypass field (target), to check whether we are calling this SFN to try something before creating a communication with the responsible team (in case the target is not set on the SFN payload) or to alert a message telling what, when, and where the problem was.

Now, applying this idea to our business rule, we can see the “physical model”:

The real phyiscal state machine on aws step functions visual architecture draw
”Error flow physical model”

The idea here is if an error happens with the type NonLegacyUserNotFoundException then we are going to call a Lambda that will check if the DMS tasks have failed and will retry to get the data (if this is actually the problem). In case the problem cannot be solved and we define the fallbackTarget field with ALERT_DEV_TEAM, then an email will be sent via SNS alerting that there’s a problem that the common flow could not solve.

Pros and Cons of Using AWS Step Functions for Complex Exception Handling

Before you start to create any kind of infrastructure, I strongly recommend you consider the following pros and cons of this approach:

Pros

  • Automatic Retries: Step Functions can automatically retry failed tasks, reducing manual intervention.
  • Fallback Mechanisms: Allows defining fallback paths to handle exceptions gracefully.
  • State Management: Step Functions maintain the state of the workflow, enabling recovery from the exact point of failure.
  • Error Handling Logic: Centralized error handling logic improves maintainability and readability.
  • Seamless Scaling: Step Functions scale automatically with the number of tasks, providing consistent performance under load.
  • Decoupling: Decouples error handling from business logic, making the architecture more modular and easier to manage.
  • AWS Ecosystem: Easy integration with other AWS services like SNS, Lambda, and DynamoDB for a comprehensive serverless solution.
  • Execution History: Provides detailed execution history and visual workflow monitoring, aiding in debugging, and performance tuning.

Cons

  • Learning Curve: Introducing Step Functions adds complexity to the architecture and requires learning a new service.
  • Increased Development Time: Implementing and managing Step Functions might increase development and maintenance efforts.
  • Remember that the costs of stepfunctions are state-change based, so measure if what you are doing is Overengineering or not.
  • Increased Latency: The orchestration of multiple steps can introduce latency, which might not be suitable for all real-time applications.
  • Execution Limits: AWS Step Functions have limits on execution history retention and state transitions, which could impact long-running workflows.

Applying the Idea as Code

Considering these ideas, we can provide the infrastructure of this pattern using the Serverless Framework. Keep in mind that we are going to mock the data.

Creating the Serverless Framework Project

I’ve created the project using the sls CLI. You don’t need to create this with the HTTP template.

Serverlesss Framework init project
”Initializing the project with sls CLI”

Applying the JS and serverless.yaml Code

  • Firstly, I’ve created the two functions following the SRP.
Serverlesss Framework lambda definitions
"Defining our Lambdas in serverless.yaml file"
  • And the infrastructure:
Serverlesss Framework IAC definitions
"Defining IAC in serverless.yaml file"

As said, we created them with mock data. So we are not actually accessing any datasources.

  • Create contract lambda script:
Script of our create-contract-lambda
"Script of our create-contract-lambda"
  • Get user data Lambda script:
Script of our get-user-data
"Script of our get-user-data"

To easily adapt my code to the Step Functions Handler, I created a package (which we could create a layer to expose) where I define some rules.

Custom serverless exception handler
"Custom serverless exception handler"

As you can see, I defined a strategy to create the payload of my Step Functions State Machine execution with the data that I need (function payload to retry, origin, etc) and only if the error throw is not the same as the previous one, I’m going to call the step functions; this is because the step functions can get stuck in a loop. So be careful about how you are going to define it.

Finally, you can define your custom error type to validate if you are going to call step functions or not, e.g:

Custom Exceptions
"Our custom exceptions"

It’s also very important that you define the name attribute to receive them on the step functions and look at what you are going to do with that data.

Conclusion

In this article, we demonstrated how AWS Step Functions can enhance the resilience and fault tolerance of Serverless Architectures by effectively managing complex exceptions. By addressing a scenario involving data synchronization between legacy and non-legacy databases, we illustrated how Step Functions can handle errors that AWS Lambda alone cannot manage due to its inherent limitations.

We detailed a payload structure and the logical and physical models for implementing this error-handling mechanism. This approach includes retry mechanisms and fallback actions, such as alerting the development team via SNS if issues persist.

Implementing this pattern using the Serverless Framework, as shown, ensures your serverless applications are more robust and capable of gracefully recovering from failures. Leveraging AWS Step Functions for complex error handling improves the reliability of your workflows and enhances the overall stability of your services.

References:

Serverless Handbook
Access free book

The dream team

At Serverless Guru, we're a collective of proactive solution finders. We prioritize genuineness, forward-thinking vision, and above all, we commit to diligently serving our members each and every day.

See open positions

Looking for skilled architects & developers?

Join businesses around the globe that trust our services. Let's start your serverless journey. Get in touch today!
Ryan Jones
Founder
Book a meeting
arrow
Founder
Eduardo Marcos
Chief Technology Officer
Chief Technology Officer
Book a meeting
arrow

Join the Community

Gather, share, and learn about AWS and serverless with enthusiasts worldwide in our open and free community.