EDA: Benefits and Pitfalls

July 5, 2024

Why use an Event Driven Architecture?

The main reason to use an Event Driven Architecture is to separate executions for loosely coupled pieces. One of the first use cases we think about is to couple two systems from different enterprises, where you don’t have control over the technology, compute power or memory of the other system.

By extending this thought, one realizes that this applies also to a single stack system. By separating different aspects of an application into smaller services, you are able to assign the right amount of compute power and the right technology to a specific portion of your overall process.

This becomes even more true with cloud computing, where you have access to many different technologies and managed services. You can allocate the right amount of resources and optimize your costs by separating the pieces.

Another benefit of using events to drive your application, is to allow asynchronous process on some parts, to optimize and reduce the perceived latency of your end users, to reduce the infrastructure needs and therefore reduce the cost.

Disassociate concerns

By disassociating concerns, each part of an application can scale independently, can use a different technology and can have a different resources allocation.

All the steps are bound together by events. Each part only knows about the event triggering it and the resulting event.

Example application

^{A fictional example showcasing the use of events to transfer data between various services of the same stack. Each service has it’s own technology, storage, compute and memory.}

Synchronous GET Endpoint

This endpoint has most of the traffic. To ensure a smooth user experience, the Lambda function is configured with reserved capacity, the memory allocation is kept to the minimum of 128MB and a max execution time of 2 seconds to reduce cost.

Asynchronous POST endpoint

The payload sent to this endpoint needs additional processing before being able to be stored.

External API call
Second POST from the application with additional data

The external API can be slow, or needs throttling. The second call from the application can come several minutes later.

To not block the user’s application, the execution of the first POST is done asynchronously. API Gateway will acknowledge the reception of the payload, send a response and the client and initiate a Step Function workflow.

Another POST with additional data

This second POST is synchronous and contains additional data needed to store our final object. Nothing special happening inside this function, we allocate 128MB to it and a maximum execution time of 10 seconds.

Step Function Workflow

The first step calls and external API. The payload returned by the API can be a very big array that we need to parse and pick information from it. We allocate 2048MB to it, and a max execution time of 120 seconds.

The second step is waiting for the additional data to be received.

On the third step, the information is written to the database, ready to be consumed by the GET endpoint.

DynamoDB Stream

Any change in the database is captured by a stream sent to Kinesis Firehose. This happens asynchronously and doesn’t affect the Step Function Workflow.

Kinesis Firehose

Kinesis Firehose will batch the incoming changes and write them to S3 in bulk. This allows to reduce the amount of writes to S3 and the amount of reads needed by Athena or Glue, improving the latency and cost.

We don’t need the data stored into S3 in real time. We configure Firehose to write a file every 5 minute or every 128MB, whatever comes first.

Athena

Athena is invoked on every new file written to S3. The data is queried, processed and written to another bucket for other application to comsume.

Glue

Glue is invoked through EventBridge Scheduler to query and process the data daily and imported into Redshift.

Redshift

Redshift serves as source for Quicksight. New data written into Redshift triggers an event to refresh Quicksight.

Quicksight

Quicksight uses it’s own SPICE storage to display the data. The SPICE storage is refreshed after a successful import into Redshift.

Pitfalls and added complexity

Idempotency

Events are necessarily unique. The producer can by mistake create the same event twice. Or the consume can process the same event multiple times.

Causes for duplication can be:

A user hitting twice a submit button
A process behind an SQS queue has a timeout and the event is sent back to the queue
General transmission issue
A service issue

An application bust therefore be able to handle duplication. This can be done by using a unique identifier and store it in a persistent storage. Before processing an event, we verify that the event hasn’t already be processed.

Some examples of idempotency tokens are:

Generated by the client
Generated by the event service (SQS, SNS, …)
Hash of payload generated by the receiver

Some Amazon services have some idempotency built in, like SNS FIFO or SQS FIFO. This services achieve idempotency by refusing new message from a client with a same identifier. Even with this kind of services, the client needs to generate a unique ID per message prior to submitting it.

It also doesn’t avoid a message going back to an SQS queue when an error or timeout happens in the consumer.

Using a service with idempotency built in doesn’t remove the need to take the possibility into account.

Traceability

Each building block of an Event Driven Application will publish logs for their own processes. Tracing a request from start to end will become a challenge.

To simplify this burden, the logs should be collected by a single service. This can simply be CloudWatch Logs or any 3rd party.

A request flow should have a unique identifier or correlationId passed to each sub task. This identifier can be created by the initial sender or by the first service initiating the flow.

AWS XRay is a service that will allow you to correlate traces not only through your different services but also give you insights on the multiple AWS services used, like connection times to DynamoDB.

With CloudWatch Logs Insights, you can query aggregated logs from your multiple services.

Data sources

When considering independent events, an important point to consider is how the data needed by a subtask is transferred by the producer.

All the data from the producer is available in the event payload
This allows to reduce the amount of requests to a database or any kind of storage. But the event can become very heavy, often exceeding the allowed size imposed by the broker.
This solution becomes an issue when multiple services are chained, each consumer adding more data to the payload for the next consumer.
Using this kind of data transfer is useful when the amount of data to be transferred is reasonable and the data needed by the consumers is known by the producer.
Send minimal information and provide endpoints for consumers to get more data
This allows to reduce the event payload size. But it needs the producer to provide endpoints to allow the consumer to fetch the needed data, adding a strain on underlying storage or databases.
It is the burden of the producer to provide the right level of caching to improve performance.
Cached by the consumer
A consumer can cache data from the producer, either by syncing the data on a schedule, or keeping it “hot” for a defined time after the first request.

Access free book

The dream team

At Serverless Guru, we're a collective of proactive solution finders. We prioritize genuineness, forward-thinking vision, and above all, we commit to diligently serving our members each and every day.

See open positions