Serverless Troubleshooting: EMFILE Issues

March 22, 2024

Introduction

When we have any process that relates to HTTP call and file processing running in any OS, there will be a limitation issue that could break the process with an error report: “EMFILE bla bla bla”-- “bla bla bla” can be any of the following reasons:

  • Exceed socket connection
  • Get error address connection
  • Exceed file open

“Exceed socket connection” and “get error address connection” are the most common EMFILE issues that succeed in making us look like losers. It happens when we have multiple HTTP calls (more than hundreds) to API endpoints all at one time in AWS Lambda. It’s hard to reproduce the issue in a local system; something I have been facing for months. One idea to make things easier is to replace the Lambda with a step function so we can break down the endpoint calls. But there are some reasonable arguments from both the business and architectural levels to keep going with Lambda.

Well, lemme explain what exactly is going on here from a practical point of view (I could be wrong theoretically). So we have one Lambda function that is connecting to 11 services running in parallel including DynamoDB query, Lambda async invocation, and API Gateway request. We did a performance test with hundreds of transactions per second, (for example 300tps). So we can surmise that Lambda will do around 3300 requests at the same time (300tps x 11 services = 3300 requests). That means Lambda can have 3300 open socket connections running all at once.

What’s wrong with Lambda doing 3300 socket connections at the same time? Nothing, but AWS has a rule which limits the maximum open socket connections. I couldn’t find any related documentation that mentions the max limit for a Lambda socket, but it is described in this document that Lambda has a File descriptors limit of 1024.

I am not confident that socket connection is one of the file descriptor issues as the file descriptor seems more related to the “Exceed file open” EMFILE issue when doing file processing. Our main issue is the socket connection. I ran some round tests to find out what the safe limit is for having open socket connections in Lambda running at the same time. According to my tests, it is safe for up to 500 connections. Also, we added a small method to counting the file descriptor in one Lambda run, it showed a small number under 50. So it seems File descriptors are not related to the EMFILE socket issue. If I’m wrong, please let me know which part you disagree with.

Once I found the root reason, I embarked on a journey to make my way out of the EMFILE issue nightmare.

Batch the Process

We divided the event records into multiple batches; around 10 batches for 300 records, so each batch would be processing 30 records. Logically it should work especially since it was still below the safe limit of 500 (30 records x 11 = 330 requests), but it was not working how I expected. The problem, I assume, is that when it moves to the next batch, there are sockets trying to close after completing the request-- not closing immediately. What if we increase the number of batches to 30? Unfortunately we have the same issue, strange eh? I don’t have any idea as to why it’s still having issues, do you?

Sequence Flow

We switched the parallel requests to be sequential to reduce the open socket slots while still combining using the batch process approach. Let's say 30 records per batch will be flowing through each request which has 30 open sockets. So I would say the first sequence service will open 30 sockets and then the next sequence service will open 30 new sockets while still trying to close the previous 30 sockets. I assume that from the 2nd sequence request until the last, there will be more or less 60 open sockets.

Yeah logically it should work, but the wall was still there. I could still see a number of EMFILE reports in the log.

API Cache

I heard this didn’t make sense to be tested. The idea is to cache any API responses which has the same request body. So for the existing cached request, we would just return the cached response instead. Some APIs potentially have the same request-response. So by caching them, we could reduce the number of API requests.

Yeah, I thought it was smart enough, but reality beat me again!

Switch Client Library

Normally, we just use the native fetch library in nodejs. It’s possible that this library is not good enough to cover a huge socket connection. So let's run another test with other libraries like Axios or native https. Yeah, we did that and the same result; not much help!

Single Connection

I read this idea in some thread, that we can reuse the connection by keeping it alive which is related to the new AWS SDK returning an EMFILE issue. It’s similar since the SDK is doing HTTP calls to a service endpoint like DynamoDB, SQS, Lambda, etc.

So I did a test with https library as it’s easier to implement as compared to the “keep it alive” rule. So I just passed Connection: ‘keep-alive’ in the request headers. Well, it gave me hope seeing the number of EMFILE issues decreasing, but in the end, we need to ensure there are none.

I Found the Door

This kind of issue challenged me to find a way out and finally, I found it.

From my point of view, we need to give the Lambda a break after it hits the safe connections limit. This gives the Lambda time to free up the socket connection before opening a new one. It can be argued this isn’t the best method but I have my valid reasons to argue back.

Ok, so the safe connection limit is up to 500. Knowing this, we will let the Lambda hold the process once the socket raises 500 connections until some of them are freed up. This is the logic flow:

  
const MAX_CONN_NUM = process.env.MAX_CONN_NUM;
const HOLD_SECOND = process.env.HOLD_SECOND;
let connNum = 0;


const holdConnection = async () => {
 if (connNum > MAX_CONN_NUM) {
   await new Promise((resolve) => {
     setTimeout(() => {
       resolve()
     }, HOLD_SECOND)
   });
   await holdConnection();
 }
}
  

There, we can set the MAX_CONN_NUM = safe limit connection to 500.

HOLD_SECOND is the time delay to hold the process.

In my last test, 100ms gave the best results in performance and stability for any recorded message.

I tested three different recorded numbers 300, 400, and 500tps which were all working well with a 100ms HOLD_SECOND, Lambda memory 1024 MB, timeout 5 minutes, Ephemeral storage 512 MB, Runtime Nodejs18, and Architecture Graviton (arm64).

Implementing it on the request is very simple:

  
 await holdConnection();
 connNum++;

 const response = await fetch(url, options);
  

You can see that connNum will increase and decrease to indicate there is a new socket open before the HTTP call and close after the request is completed. It will let the holdConnection know whether the socket connection has reached the safe limit (MAX_CONN_NUM). If it rises above the limit, just hold the process until you free up one or more sockets. It’s as simple as that, I hope the idea helps!

WDYT? Reach out to me on LinkedIn

Conclusion

When we are faced with a particular technical issue that is not documented well in any official platform or dev channel, it could be a new challenge for us to keep thinking positively while trying our hardest to get it resolved. I am pretty sure the above solution might not follow any best practices, and I’ve been told that implementing Lambda for multiple service calls is not a good choice, and instead, I should switch to using Step-Functions. But, when you think about the business and architectural requirements that might rely on Lambda, then I think that this solution makes sense and is applicable.

So, is it fair to say that the best practice is not always the best choice? In my opinion, it depends on the usecase and issues you're facing.

References

https://stackoverflow.com/questions/10355501/node-js-emfile-error-with-increasing-traffic

https://github.com/aws/aws-sdk-js-v3/issues/3279

https://github.com/samswen/lambda-emfiles

https://bahr.dev/2021/06/03/lambda-emfile/

https://repost.aws/questions/QURpfCWjifS3qEZb3A0K0j3w/lambda-file-descriptor-limits-does-it-include-network-connections

https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html

https://github.com/aws/aws-sdk-js-v3/issues/5273#issuecomment-1965013238

Access free book

The dream team

At Serverless Guru, we're a collective of proactive solution finders. We prioritize genuineness, forward-thinking vision, and above all, we commit to diligently serving our members each and every day.

See open positions

Looking for skilled architects & developers?

Join businesses around the globe that trust our services. Let's start your serverless journey. Get in touch today!
Ryan Jones
Founder
Speak to a Guru
Edu Marcos
Chief Technology Officer
Speak to a Guru
Mason Toberny
Head of Enterprise Accounts
Speak to a Guru

Join the Community

Gather, share, and learn about AWS and serverless with enthusiasts worldwide in our open and free community.