Looking for Senior AWS Serverless Architects & Engineers?
Let's TalkIntroduction
In today's fast-paced world of cloud computing, where data flows ceaselessly between services and systems, ensuring the reliable delivery of messages has become paramount. Amazon Web Services (AWS) recognizes the importance of seamless messaging, which is why Amazon Simple Queue Service (SQS) has become a go-to choice for developers
But what happens when messages encounter roadblocks at their destination, causing disruptions in the flow of information? The answer lies in SQS Dead-Letter Queues (DLQs).
In this article, we will see how to set up an AWS SQS with a Dead-Letter Queue on your Serverless Framework project using Infrastructure as code (IaC) and understand the value of this feature for fault-tolerance systems.
This article covers:
- What is a SQS DLQ?
- When will you want to avoid SQS DLQ?
- The main benefits of SQS DLQs.
- How to set up a SQS DLQ with SLS.
- Bonus: Handling Batch Errors easily in Lambda with SQS
What is a SQS DLQ?
The Problem
When you are developing or designing an architecture that involves queues, such as a high-volume write flow like the one below:
The "External Request" component that requests the SNS does not wait for the writing to finish. But, the SQS communication with Lambda is synchronous.
You can address the scenario where processing takes place after message delivery, and this component receives an unexpected data format - in the example from the diagram above, this component would be the Lambda function, or it could simply be a scenario that hasn't been covered by unit tests or even considered. When this happens, it's possible for a bug to emerge, and the processing of your Lambda function, or whatever you use to receive the message, may result in a failure.
In certain situations with Lambda, it might be something momentary, like a network failure or something similar. For such cases, retry mechanisms are available, which has been discussed in another article written by Samuel Lock.
However, if it's something unexpected and persistent, or even intermittent, it becomes necessary to isolate these failed messages to analyze the cause as soon as possible. The development team can then implement a hotfix and, if applicable, devise a retry strategy for these isolated messages, but only after the hotfix has been applied.
The Solution
Suggested revision: AWS services can act as building blocks, such as Legos. In this case, we can connect a SQS to a DLQ without changing the code of the current workflow, effectively designating it as the repository for messages that have encountered processing failures.
The DLQ will route problem messages to the components subscribed to it. In the example from the diagram above, we can connect this DLQ to a Lambda function. Instead of performing the processing that resulted in the failure, the Lambda function can simply store the messages in a data store like DynamoDB. Compared to leaving messages in SQS DLQs, storing them in a data store provides more flexibility and options for handling complex error scenarios, enabling queries to understand the inputs that caused errors. This process may involve writing new unit test cases, fixing the error in the Lambda code that triggered the exception, and, if it aligns with your processing type, attempting manual or automatic retries of these messages.
Additionally, instead of using DynamoDB, you can connect it to any other component, such as saving the error-causing inputs in an S3 bucket as JSON files and querying them using Athena. Once you can isolate error messages in a DLQ, the possibilities for precisely how you can handle them are limitless.
When will you want to avoid SQS DLQ?
We have two primary use cases where you want to avoid using a DLQ:
- With standard queues when you want to be able to keep retrying the transmission of a message “indefinitely”.
- With a FIFO queue if you don't want to break the exact order of messages or operations.
However, in these two scenarios, you may still consider using a DLQ, but it's most appropriate when:
- You need to troubleshoot incorrect message transmission.
- You aim to reduce the number of messages in your queue.
- You want to minimize the risk of exposing your system to poison-pill messages (messages that can be received but cannot be processed).
Unless you know very well why you don't want to use it, I still recommend using a DLQ, so you don't take the risk of losing your messages permanently.
The main benefits of DLQs
- Message Integrity. By isolating messages that have failed to process correctly, SQS DLQs help maintain data integrity.
- Enhanced Reliability. This feature ensures that no message is lost, even when unexpected errors occur in your system.
- Customized Error Handling. DLQs offer flexibility in handling failed messages. You can connect DLQs to various components, such as AWS Lambda, databases, or storage services like Amazon DynamoDB or Amazon S3, to implement customized error handling and analysis procedures. This flexibility allows you to choose the most suitable approach for your specific use case.
How to set up a SQS DLQ with Serverless Framework
As a prerequisite, you must have the Serverless Framework and Node.js installed.
First and foremost, you need to initiate a new project. There are several templates created by the ServerlessGuru team on GitHub at this link. For instance, you can find the Webpack template, which helps you achieve a smaller bundle size. Another widely used option is the Serverless Framework's own templates, which can be utilized with the following command, for example: 'serverless create --template aws-nodejs-ecma-script --path'.
Once the project is created, proceed to define the necessary resource syntax. Follow these steps to configure a Dead Letter Queue (DLQ). I will use names of my choice and recommend using the same ones for learning purposes. After it's up and running, you can make changes as needed.
1. Define the Dead Letter Queue (DLQ): In your resources section inside “resources: Resources:”, you will define the Dead Letter Queue (DLQ) for your SQS queue as 'DeadLetterQueueSubscribeNews'. Just remember that DLQ is where messages that couldn't be successfully processed will be sent.
2. Configure the main SQS Queue to use DLQ: In the 'SubscribeNewsQueue' resource definition, you will specify the 'RedrivePolicy', which is used to configure the main SQS queue to send messages to the DLQ when they fail processing. Here's what you have:
For our DLQ purpose about these keys, the most important one is “RedrivePolicy”, but you should know what they do:
- RedrivePolicy: This defines a dead-letter target queue named 'DeadLetterQueueSubscribeNews' and sets 'maxReceiveCount' to 3. Messages failing processing three times will be moved to the dead-letter queue for isolation and handling.
- MessageRetentionPeriod: This is set to 3600 seconds (1 hour). Messages will be stored in the queue for up to 1 hour before automatic deletion.
- VisibilityTimeout: It's configured for 30 seconds. Messages become "invisible" to other consumers for this duration after being picked up for processing.
- ReceiveMessageWaitTimeSeconds: This parameter is set to 20 seconds, enabling long-polling to reduce unnecessary API requests during message retrieval.
3. Set up your Lambda function to use the main SQS Queue: Your Lambda function, 'subscribe', will be configured to be triggered by the 'SubscribeNewsQueue' SQS queue through the 'events' section, before check the screenshot bellow, it’s worth to mention that I’m using a widely used plugin to handle iam role inside the function configuration called serverless-iam-roles-per-function, so if you don’t want to use traditional iam role statement, install this plugin before continue:
This means that your Lambda function will consume messages from the 'SubscribeNewsQueue'.
4. Set up ANOTHER Lambda function to use the DLQ: Nothing special here, it’s almost the same configuration from the last Lambda, but it’s referencing the DQL
After these steps, you will have a DLQ working. If there are some doubts about how to handle the messages, I did a fully functional example on this repository.
Handling Batch Errors Easily in Lambda with SQS
You may have noticed the batch configuration in Lambda, let's understand this topic better. By default when you’re using batch in AWS Lambda with SQS as a trigger, if one message fails in a batch, the entire batch is retried including successfully processed messages. This default behavior can be inefficient, especially for large batches.
Not so long ago the challenge was handling errors effectively without marking the entire batch as successful. But after Serverless Framework 2.67.0 version was introduced the 'functionResponseType' option with the value 'ReportBatchItemFailures' to address this.
When 'functionResponseType' is set to 'ReportBatchItemFailures':
- Only the specific failed message is retried.
- Successfully processed messages in the batch are not retried.
For example, if you have one batch of 5 messages and 3 of them fail, but 2 are successful in processing. When Retry occurs, only the trigger for these 3 failed messages will be retriggered on batch and not the initial 5.
The only thing you need to do after setting the 'functionResponseType' to 'ReportBatchItemFailures'. It is to change the return of your code to return a list containing the ID of the messages that failed, like the example below:
In order to be able to assimilate a complete flow, I added another 2 code examples that deal with the ID of the messages that failed:
- Batch Item Failures Try Catch Example - A traditional for loop around a try-catch pushing the id of failed messages id into an array. Less memory usage.
- Batch Item Failures Promise.allSettled Example - A modern functional approach using “Promise.allSettled”, “Array.map” and “Array.reduce” to generate a new array with failed messages id. A little bit more usage of memory.
If you want to use another approach with predefined wrappers and configure 'functionResponseType' on cloudformation, check out the Lambda Powertools. It has a utility to help with this type of batch processing.
This setting improves efficiency and accuracy in handling errors during batch processing.
Conclusion
In this article, you learned how to prevent message loss when unexpected situations occur in your system.
AWS SQS Dead Letter Queues are vital components for ensuring the reliability and fault tolerance of your cloud-based applications. They enable easier troubleshooting of messages and enhance the integrity of your system.
Using this new skill in your toolkit when building serverless apps will empower developers to build robust systems that can withstand the unpredictability of the digital landscape.
If you have any questions about this topic, you can reach me by opening an issue on my GitHub or by contacting Serverless Guru on social media (Twitter) (Linkedin).