You want to add the capability to add a search functionality to your application, but the storage you are currently using doesn't offer an easy out-of-the-box way to address this.
You might want to be tempted to use solutions like ElasticSearch, AWS OpenSearch or any Search Infrastructure Service, which often is the right choice.
What if your needs don't require the heavy lifting and the cost of full blown search infrastructures? Why not use a Serverless based search solution?
I’ve had my eyes on LunrJS for a while to provide search capabilities on a static website. LunrJS is primarily made for browser integrations, but since it also works in Node, why not integrate it with a Lambda function and provide search as an API?
Pre-Build and Share the Indexes
Indexing the data is the most time consuming part of this operation. We need to pre-build the indexes for each search request and store them in a shared storage.
We will be using Amazon S3 for this. We could also use Amazon EFS for faster retrievals. But by caching the index in memory, we only get penalized on the first load.
Memory Limits
The index is loaded into memory, but the result of a search query only returns the elements identifiers.
To return all the fields of the resulting documents, they would need to be fetched from the source. This also is loaded into memory.
Since Lambda can have a maximum of 10GB of memory, if the sizes for your index and documents exceeds Lambda's memory, this solution isn't for you.
You could retrieve the documents from the storage each time using a stream, but this will come with a latency penalty.
The Solution
The application is separated in 2 independent parts: indexing and searching.
Full Re-Index
Creating an index is a full re-index, meaning that you need to be able to scan your entire DynamoDB table and re-index all documents. This could become costly if the process is triggered on every change. You can mitigate this by running the indexing on a schedule, by leaving your source data and search results out of sync during a period.
Partial Indexing
With the help of the lunr-mutable-indexes extension, we can listen to DynamoDB Streams and update our index for every row change without the need to re-index the whole dataset.
Indexes generated with lunr-mutable-indexes are slightly bigger, but are directly usable by lunr.
Let's build it
We will use the serverless.com framework to build our application. Source Code is available on Github.
We will use source files as CSV uploaded to a bucket for the searchable data.
S3 Bucket to store sources and indexes
Using Cloudformation, we provision a bucket and enable EventBridge to allow listening to new incoming file events.
To reduce storage access and improve latency, the index and the documents are cached in memory outside the handler top be re-used on subsequent invokes.
Search Lego sets produced in 1984 in the Duplo theme
• query: 'search=+year:1984 +theme:duplo'
• amount of results: 17
• loading index and documents: 1927 ms (only on first load)
• query duration: 10 ms
• memory used: 562 MB
Search Movies with Ryan Reynolds
• query: 'search=+ryan +reynolds'
• amount of results: 37
• loading index and documents: 3924 ms (only on first load)
• query duration: 5 ms
• memory used: 690 MB
Search Movies with the word "extra-terrestrial" in the synopsis
• query: 'search=overview:extra-terrestrial'
• amount of results: 17
• loading index and documents: 3411 ms (only on first load)
• query duration: 4 ms
• memory used: 690 MB
As we can notice, loading the index is very slow, but twice as fast as building it. Requests using an already loaded function, don't have any latency penalty.
Cost analysis
Let's consider the movie dataset. And make the following assumptions:
Source file updated daily (30 times a month)
1M search requests per month
• 25% are fresh: 250,000
• 75% are re-using an already loaded Lambda: 750,000
Item
Volume
Monthly Cost
S3 Storage
CSV + Index + Docs
53.2 MB
$0.0012
S3 PUT
Daily CSV + Index + Docs
90
$0.0005
S3 GET
250k Indexes + 250k Docs
500000
$0.2150
Lambda Indexing
2048 MB Memory, 10s
30
$0.01
Lambda Search Cold
1024 MB Memory, 5s
250000
$16.72
Lambda Search Hot
1024 MB Memory, 0.5s
750000
$1.40
Total
$18.13
Comparing this to other available solutions from AWS:
At Serverless Guru, we're a collective of proactive solution finders. We prioritize genuineness, forward-thinking vision, and above all, we commit to diligently serving our members each and every day.
You want to add the capability to add a search functionality to your application, but the storage you are currently using doesn't offer an easy out-of-the-box way to address this.
You might want to be tempted to use solutions like ElasticSearch, AWS OpenSearch or any Search Infrastructure Service, which often is the right choice.
What if your needs don't require the heavy lifting and the cost of full blown search infrastructures? Why not use a Serverless based search solution?
I’ve had my eyes on LunrJS for a while to provide search capabilities on a static website. LunrJS is primarily made for browser integrations, but since it also works in Node, why not integrate it with a Lambda function and provide search as an API?
Pre-Build and Share the Indexes
Indexing the data is the most time consuming part of this operation. We need to pre-build the indexes for each search request and store them in a shared storage.
We will be using Amazon S3 for this. We could also use Amazon EFS for faster retrievals. But by caching the index in memory, we only get penalized on the first load.
Memory Limits
The index is loaded into memory, but the result of a search query only returns the elements identifiers.
To return all the fields of the resulting documents, they would need to be fetched from the source. This also is loaded into memory.
Since Lambda can have a maximum of 10GB of memory, if the sizes for your index and documents exceeds Lambda's memory, this solution isn't for you.
You could retrieve the documents from the storage each time using a stream, but this will come with a latency penalty.
The Solution
The application is separated in 2 independent parts: indexing and searching.
Full Re-Index
Creating an index is a full re-index, meaning that you need to be able to scan your entire DynamoDB table and re-index all documents. This could become costly if the process is triggered on every change. You can mitigate this by running the indexing on a schedule, by leaving your source data and search results out of sync during a period.
Partial Indexing
With the help of the lunr-mutable-indexes extension, we can listen to DynamoDB Streams and update our index for every row change without the need to re-index the whole dataset.
Indexes generated with lunr-mutable-indexes are slightly bigger, but are directly usable by lunr.
Let's build it
We will use the serverless.com framework to build our application. Source Code is available on Github.
We will use source files as CSV uploaded to a bucket for the searchable data.
S3 Bucket to store sources and indexes
Using Cloudformation, we provision a bucket and enable EventBridge to allow listening to new incoming file events.
To reduce storage access and improve latency, the index and the documents are cached in memory outside the handler top be re-used on subsequent invokes.
Search Lego sets produced in 1984 in the Duplo theme
• query: 'search=+year:1984 +theme:duplo'
• amount of results: 17
• loading index and documents: 1927 ms (only on first load)
• query duration: 10 ms
• memory used: 562 MB
Search Movies with Ryan Reynolds
• query: 'search=+ryan +reynolds'
• amount of results: 37
• loading index and documents: 3924 ms (only on first load)
• query duration: 5 ms
• memory used: 690 MB
Search Movies with the word "extra-terrestrial" in the synopsis
• query: 'search=overview:extra-terrestrial'
• amount of results: 17
• loading index and documents: 3411 ms (only on first load)
• query duration: 4 ms
• memory used: 690 MB
As we can notice, loading the index is very slow, but twice as fast as building it. Requests using an already loaded function, don't have any latency penalty.
Cost analysis
Let's consider the movie dataset. And make the following assumptions:
Source file updated daily (30 times a month)
1M search requests per month
• 25% are fresh: 250,000
• 75% are re-using an already loaded Lambda: 750,000
Item
Volume
Monthly Cost
S3 Storage
CSV + Index + Docs
53.2 MB
$0.0012
S3 PUT
Daily CSV + Index + Docs
90
$0.0005
S3 GET
250k Indexes + 250k Docs
500000
$0.2150
Lambda Indexing
2048 MB Memory, 10s
30
$0.01
Lambda Search Cold
1024 MB Memory, 5s
250000
$16.72
Lambda Search Hot
1024 MB Memory, 0.5s
750000
$1.40
Total
$18.13
Comparing this to other available solutions from AWS: