Looking for Senior AWS Serverless Architects & Engineers?
Let's TalkWhy use an Event Driven Architecture?
The main reason to use an Event Driven Architecture is to separate executions for loosely coupled pieces. One of the first use cases we think about is to couple two systems from different enterprises, where you don’t have control over the technology, compute power or memory of the other system.
By extending this thought, one realizes that this applies also to a single stack system. By separating different aspects of an application into smaller services, you are able to assign the right amount of compute power and the right technology to a specific portion of your overall process.
This becomes even more true with cloud computing, where you have access to many different technologies and managed services. You can allocate the right amount of resources and optimize your costs by separating the pieces.
Another benefit of using events to drive your application, is to allow asynchronous process on some parts, to optimize and reduce the perceived latency of your end users, to reduce the infrastructure needs and therefore reduce the cost.
Disassociate concerns
By disassociating concerns, each part of an application can scale independently, can use a different technology and can have a different resources allocation.
All the steps are bound together by events. Each part only knows about the event triggering it and the resulting event.
Example application
Synchronous GET Endpoint
This endpoint has most of the traffic. To ensure a smooth user experience, the Lambda function is configured with reserved capacity, the memory allocation is kept to the minimum of 128MB and a max execution time of 2 seconds to reduce cost.
Asynchronous POST endpoint
The payload sent to this endpoint needs additional processing before being able to be stored.
- External API call
- Second POST from the application with additional data
The external API can be slow, or needs throttling. The second call from the application can come several minutes later.
To not block the user’s application, the execution of the first POST is done asynchronously. API Gateway will acknowledge the reception of the payload, send a response and the client and initiate a Step Function workflow.
Another POST with additional data
This second POST is synchronous and contains additional data needed to store our final object. Nothing special happening inside this function, we allocate 128MB to it and a maximum execution time of 10 seconds.
Step Function Workflow
The first step calls and external API. The payload returned by the API can be a very big array that we need to parse and pick information from it. We allocate 2048MB to it, and a max execution time of 120 seconds.
The second step is waiting for the additional data to be received.
On the third step, the information is written to the database, ready to be consumed by the GET endpoint.
DynamoDB Stream
Any change in the database is captured by a stream sent to Kinesis Firehose. This happens asynchronously and doesn’t affect the Step Function Workflow.
Kinesis Firehose
Kinesis Firehose will batch the incoming changes and write them to S3 in bulk. This allows to reduce the amount of writes to S3 and the amount of reads needed by Athena or Glue, improving the latency and cost.
We don’t need the data stored into S3 in real time. We configure Firehose to write a file every 5 minute or every 128MB, whatever comes first.
Athena
Athena is invoked on every new file written to S3. The data is queried, processed and written to another bucket for other application to comsume.
Glue
Glue is invoked through EventBridge Scheduler to query and process the data daily and imported into Redshift.
Redshift
Redshift serves as source for Quicksight. New data written into Redshift triggers an event to refresh Quicksight.
Quicksight
Quicksight uses it’s own SPICE storage to display the data. The SPICE storage is refreshed after a successful import into Redshift.
Pitfalls and added complexity
Idempotency
Events are necessarily unique. The producer can by mistake create the same event twice. Or the consume can process the same event multiple times.
Causes for duplication can be:
- A user hitting twice a submit button
- A process behind an SQS queue has a timeout and the event is sent back to the queue
- General transmission issue
- A service issue
An application bust therefore be able to handle duplication. This can be done by using a unique identifier and store it in a persistent storage. Before processing an event, we verify that the event hasn’t already be processed.
Some examples of idempotency tokens are:
- Generated by the client
- Generated by the event service (SQS, SNS, …)
- Hash of payload generated by the receiver
Some Amazon services have some idempotency built in, like SNS FIFO or SQS FIFO. This services achieve idempotency by refusing new message from a client with a same identifier. Even with this kind of services, the client needs to generate a unique ID per message prior to submitting it.
It also doesn’t avoid a message going back to an SQS queue when an error or timeout happens in the consumer.
Using a service with idempotency built in doesn’t remove the need to take the possibility into account.
Traceability
Each building block of an Event Driven Application will publish logs for their own processes. Tracing a request from start to end will become a challenge.
To simplify this burden, the logs should be collected by a single service. This can simply be CloudWatch Logs or any 3rd party.
A request flow should have a unique identifier or correlationId passed to each sub task. This identifier can be created by the initial sender or by the first service initiating the flow.
AWS XRay is a service that will allow you to correlate traces not only through your different services but also give you insights on the multiple AWS services used, like connection times to DynamoDB.
With CloudWatch Logs Insights, you can query aggregated logs from your multiple services.
Data sources
When considering independent events, an important point to consider is how the data needed by a subtask is transferred by the producer.
- All the data from the producer is available in the event payload
- This allows to reduce the amount of requests to a database or any kind of storage. But the event can become very heavy, often exceeding the allowed size imposed by the broker.
- This solution becomes an issue when multiple services are chained, each consumer adding more data to the payload for the next consumer.
- Using this kind of data transfer is useful when the amount of data to be transferred is reasonable and the data needed by the consumers is known by the producer.
- Send minimal information and provide endpoints for consumers to get more data
- This allows to reduce the event payload size. But it needs the producer to provide endpoints to allow the consumer to fetch the needed data, adding a strain on underlying storage or databases.
- It is the burden of the producer to provide the right level of caching to improve performance.
- Cached by the consumer
- A consumer can cache data from the producer, either by syncing the data on a schedule, or keeping it “hot” for a defined time after the first request.