API Gateway+Firehose+S3+Athena: For Data Storing & Analysis

curl -X POST https://{yourUrlAfterDeployHere}/dev/recipe-apigw-athena -H "Content-Type: application/json" -d '{ " source": "financialSystem", "logtype": "error", "timestamp": "1694209417035", "message": "ooops, something went wrong" }'

Let's Talk

Introduction

Nowadays we don't just want data, we want information, which is quite different. Data is like rough stones ⛏️ extracted from mining, while, information is the polished diamond 💎✨.

This article aims to help you prepare a simple and effective stack where you can store data (rough stones), and then carry out analyses (extracting diamonds), in a fully scalable low-cost way 💸.

The Pattern

The “AWS Lego”

When it comes to data analysis, AWS offers many sorts of tools; like a “Lego” with infinite possibilities. Today, I’m showing a pattern created and recommended by a friend of mine, Marcelo Andrade, in it we will have a very simple API that will feed a columnar database (a very performant way for doing data analysis), and all of this will be saved into an S3 bucket.

Which “Lego” Pieces Will We Be Using?

API Gateway (to create our API)
S3 (where we will store the data)
Glue Database (to organize the data in S3 in a structured way)
Kinesis Firehose (is the data stream that will connect API Gateway and Glue database + S3)
Lambda (let's create a JavaScript that will guarantee partitioning)
Athena (for data analysis)

Finally, The Pattern!

How about understanding this pattern with a real-world example? Below we have an API to store Logs, and you will be able to extract metrics later (without using CloudWatch 😀).

Any application wants to send logs to the API; (Rough stones ⛏️🪨)
Endpoint/POST on API Gateway receives JSON with logs
API Gateway makes a Service Proxy, in other words, it communicates directly with the Kinesis Firehose without the need for a Lambda
Kinesis Firehose converts this “raw” JSON to a columnar format (Apache Parquet, a performative format for data analysis)
Kinesis Firehose then uses the Glue Database, which basically saves the JSON converted into Parquet in directories (a.k.a partitions) inside of an S3 bucket
A user using Athena can consume this partitioned information saved in S3 through SQL queries; (Polished diamond 💎✨)
This information can be consumed later by other processing; not only by users on the Athena console.

Real-world applications

Storage of metrics, logs, and traces for later analysis, which can replace CloudWatch and thus reduce costs
For example, you can use the API to feed sales data from your e-commerce in real time and make statistical predictions of profit & churn, and identify anomalies.
Data on sales, financial, stock market, supply chain, health, performance, telecommunications, etc.
Any data format that requires complex analytical queries or that requires aggregations across multiple dimensions — which is common in large data analysis environment scaling.

Show me the code!

You can check out the pattern in this Repository, here are some highlights of the most important points of the pattern:

1. Our bucket to save logs - 'resources/s3.yml':

2. Glue Database - 'resources/glue.yml':

in 'InputFormat' and 'OutputFormat' it was specified that we want the Parquet format (which is that columnar database);
in 'TableInput > StorageDescriptor > Columns' the database schema was specified;
in 'TableInput > PartitionKeys' it was specified that the database is partitioned by year, month, and day.

3. Create a Stream with Kinesis Firehose and connect with Glue Database and S3:

in 'KinesisDeliveryStream > Properties', there is a property called 'ExtendedS3DestinationConfiguration', which means that the result of this stream will be stored into a S3 bucket (we reference this bucket in BucketARN);
in 'DataFormatConversionConfiguration > InputFormatConfiguration' we are specifying that we receive data in JSON (OpenXJsonSerDe);
in 'DataFormatConversionConfiguration > OutputFormatConfiguration' we are specifying that the Stream converts the data into Parquet (ParquetSerDe);
in 'DataFormatConversionConfiguration > SchemaConfiguration' we reference the Glue Database that we created earlier.

4. Now we want an endpoint on API Gateway doing Service Proxy for Kinesis. The trick here is that we are going to use the 'serverless-apigateway-service-proxy' plugin for this, just include it in 'serverless.yml':

in 'streamName' we reference the Stream that we created previously, the plugin will be able to create an endpoint in the API Gateway on its own.

5. Let's configure Athena, so we will be able to make queries on this mass of data that will be stored in an S3 bucket. The trick here is to include the serverless-athena plugin, in the 'serverless.yml':

in 'databases > name' we must be careful, we do not use the same name as the Glue Database;
in 'databases > ddl' we specify a SQL file* with a 'CREATE DATABASE' statement for Athena;
in 'databases > tables > ddl' we specify a SQL file* with a 'CREATE TABLE' statement with the same specifications that we used earlier in Glue Database.

*you can check these files here: repository

6. Finally, we will guarantee to partition in Athena each time a file is submitted to the API and saved in Parquet format into an S3 bucket, for this we create a Lambda that is triggered every 's3:ObjectCreated'. Add to 'serverless.yml':

7. Here is the Lambda handler - 'athena/handlers/partitionUpdate.js':

Let's play with our Logs API 😀!

1. 'npm run deploy' or 'npx serverless deploy --stage dev --region us-east-1'

2. Wait for deployment, get the output URL from API Gateway, and do a POST request with a Log in JSON format:

  
curl -X POST https://{yourUrlAfterDeployHere}/dev/recipe-apigw-athena -H "Content-Type: application/json" -d '{ " source": "financialSystem", "logtype": "error", "timestamp": "1694209417035", "message": "ooops, something went wrong" }'

3. Query in Athena searching for the number of results by logtype (error, warning, info, etc.)

It is important to note that we benefit from filtering by the attributes that were used for partitioning the data (year, month, and day). With massive amounts of data, this has a great performance impact, since Athena charges only per amount of data read.

Now just plug the API into your application and use it at will, all resources used will scale automatically (yeah, it’s serverless power! 😀), with very high storage capacity at low cost (S3), and you also have access to very performant data analysis at low cost.

Conclusion

There are many ways to do data storing and analysis, but it’s very convenient to do it by API, this one will save a lot of money storing data in S3 and by only paying to read the data using partitions with Athena.

With AWS Calculator, I estimated 5 GB of Logs read and 3 GB of data analyzed. While using CloudWatch it would cost 5,06 USD/month, with this pattern it costs 1,78 USD/month, almost 3 times cheaper.

This pattern supports large amounts of data with no performance and scalability gaps, you could input data from other sources using Glue, and there are many possibilities for triggering other processings from Athena query results, such as ETLs, for example.

Sometimes you are in a low-budget situation, so every piece of coin matters. As shown with the Logs API example, many times you don’t need all resources from CloudWatch, only a few things for a simple solution. Patterns like these solve many real-world cases with less code.

If you are not a data analyst expert like me, don’t be worried, I think this pattern may help you as a good starting point, so feel free to contact me. My special thanks to Marcelo Andrade for this pattern— it was an eye-opener for me.

References

Example repository: https://github.com/JefAlive/apigw-athena
JSON to Apache Parquet: https://docs.aws.amazon.com/firehose/latest/dev/record-format-conversion.html
Working with Partitioned Data in Glue: https://aws.amazon.com/pt/blogs/big-data/work-with-partitioned-data-in-aws-glue/
Performance Tunings for Athena: https://aws.amazon.com/pt/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/

Jeferson Kersten

Full Stack Engineer and Senior Serverless Developer

Passionate for coding since he is a teenager, focused on cost-effective and easy-to-use solutions, music passionate and guitar player.