6

I have recently deployed a Django backend application to AWS Lambda using Zappa.

After the lambda function has not been invoked for some time, the first request to be made takes from 10 to 15 seconds to be processed. At first I thought it would be because of the cold start but even for a cold start this time is unacceptable. Then, reading through Zappa's documentation I saw that it enables by default the keep_warm feature that sends a dummy request to the lambda function every 4 minutes to keep it warm; so this excessive delay in the response to the first request to the lambda is not due to a cold start.

Then, I started using tools such as AWS X-Ray and Cloudwatch Insights to try to find the explanation for the delay. Here is what I found out:

The invokation that takes a very long time to be processed is the following: enter image description here

Crossed out in red are the names of the environment variables the application uses. They are all defined and assigned a value directly in the AWS Console. What I don't understand is, first of all, why it takes so long, and secondly, why it says the environment variables are casted as None. The application works perfectly (apart from the massive delay in the first request) so the environment variables are correctly set somewhere.

This request is made every two hours religiously and the first time someone invokes the lambda function in some time, as seen in the following chart:

enter image description here

The dots in the x axis correspond to Zappa's dummy requests to keep the server warm. The elevated dots correspond to the invocation shown in the previous image. Finally, the spike corresponds to a user invocation. The time it took to process is the sum of the time it takes to process the long invocation (the one shown in the first image) and the time it takes to process the longest http request the client makes to the server. This request was the following:

enter image description here

It was a regular login request that should be resolved much faster. Other requests that are probably more demanding than this one were resolved in less than 100ms.

So, to sum up:

  1. There is one lambda invocation that takes more than 10 seconds to be resolved. This corresponds to the first image shown. It is done every 2 hours and when a user makes a request to the server after it has been idle for some time.
  2. Some requests take more than 2 seconds to be resolved and I have no idea as to why this could be.
  3. Apart from these previous function invocations, all other requests are resolved in a reasonable time frame.

Any ideas as to why these invocations could be taking so much time is very much appreciated as I have spent quite some time trying to figure it out on my own and I have ran out of ideas. Thank you in advance!

Edit 1 (28/07/21): to further support my suspicion that this delay is not due to a cold start here is the "Segments Timeline" of the function in Cloudwatch/Application monitoring/Traces:

enter image description here

If it were a cold start, the delay should appear in the "Initialization" segment and not in the "Invocation" one.

Edit 2 (30/07/21): I forgot to mention that I had previously deployed the application using Elastic Beanstalk and didn't face this problem whatsoever so my code's performance is probably not the problem here.

Edit 3 (30/07/21): I found this thread in an AWS forum from 2016 regarding this exact issue. An AWS engineer mentioned that this behaviour is not by any means expected for a Lambda function outside of a VPC (like mine). Nevertheless, no answer was provided that explained the cause of the 10-15 seconds delay.

Edit 4 (03/08/21): I tried doubling the function's assigned memory (from 512 MB to 1024 MB) but it did not help.

Zufra
  • 574
  • 1
  • 7
  • 16
  • Is it not an option to use 1 minute as keep warm? – Brian Destura Jul 28 '21 at 06:29
  • Yes, but as I mentioned in the question that huge delay can't be a cold start. And it happens every two hours even if the function is not invoked. – Zufra Jul 28 '21 at 18:54
  • I looked again at the docs. Since AWS manages the resources to be used when the lambda cold starts, that delay can be a few hundred ms to a few seconds. So it seems we need to expect that it can happen if we use less frequent keep warms :( – Brian Destura Jul 28 '21 at 23:40
  • Hmm so do you think changing warmup to 1 minute could help? I will try that and update the question with result. However, I still think this is not a coldstart. I added another piece of evidence to support this train of thought. – Zufra Jul 28 '21 at 23:46
  • From the logs and everything else you are showing, it does not seem to be a cold start issue. I haven't encountered cold starts that long since they've changed their approach, and even then these durations would only happen for lambdas running inside a VPC. I presume it must be some code execution, and not necessarily lambda itself. This can be your code or the framework of course. I doubt changing the keep-warm interval will help much. – stijndepestel Jul 30 '21 at 12:34
  • Have you tried upping the memory? – Derrops Aug 03 '21 at 06:48
  • The lambda function has assigned the default 512 MB and the invocation in question says it used a maximum of 175 MB, as seen in the first image included in the question. Nevertheless, I will try doubling the memory but I doubt it will help. – Zufra Aug 03 '21 at 16:23
  • What database are you using? Not sure if Zappa warmer is keeping your database connection active. – Satya Mishra Aug 05 '21 at 23:10
  • I am using Postgres within the AWS RDS service. How can I know if the database connection is what is taking so long? – Zufra Aug 07 '21 at 01:48

2 Answers2

0

I have also added some comments to the question to explain that this is probably not due to a cold start. As you rightly stated, cold starts are explicitly indicated and seem to only take about 500 ms in your case.

Cold starts this long usually only manifested themselves when lambdas were run in a VPC. And AWS has since changed the way lambdas get their network interface which has dramatically sped up that process.

That being said, a quick Google search led me to some interesting discussions on other sites about Django applications and lazy loading. I'll share some links here (even though they are not related to Lambda) in the hope they can help you find a solution:

https://community.webfaction.com/questions/11560/django-app-seems-very-slow-to-start-up-10-seconds https://ses4j.github.io/2015/11/23/optimizing-slow-django-rest-framework-performance/

As a last note about the keep_warm. Sending those requests is quite an old trick in the book. However, be aware that there are no guarantees as to how long a lambda is kept warm by AWS. If an Init duration is indicated in the logs, however, you can be sure that it was a cold start. If you need to ensure that a lambda function is warm and quick to respond to incoming requests, you'll have to use provisioned concurrency, which of course has its own price tag.

Dharman
  • 30,962
  • 25
  • 85
  • 135
stijndepestel
  • 3,076
  • 2
  • 18
  • 22
  • Hello @stijndepestel, thanks for your answer. I'll look into the links you provided and let you know if they helped! – Zufra Jul 30 '21 at 13:24
  • The first link suggest a specific configuration of the httpd.conf file of an Apache web server. As I am using Lambda I don't think it is possible to control that. The second one is about eager loading using the Django Rest Framework. I added a second edit to my question to explain why my code is not the cause of this problem. Thank you for your answer anyway! – Zufra Jul 30 '21 at 13:40
  • 1
    Not to point fingers, but it is a bold claim to say that it isn't your code that could cause this problem. If you compare to Elastic Beanstalk (which is just an EC2 server) you will have been running on much "stronger" hardware than you are with lambda. You could try to adjust the memory settings and see how that affects things. From the logs you posted, it is also clear that (for both slow requests) the time is spent while executing your code. So it's presumably either your code or a downstream service you depend upon that is causing the delay and not the lambda service itself. – stijndepestel Jul 30 '21 at 13:57
  • You are right, added the keyword "probably" to my question. Nevertheless, the EC2 instance used in my Elastic Beanstalk deployment was very basic (just 1GB of RAM) and there were no delays. Furthermore, reading that several people have experienced exactly the same problem that I am reinforced the idea that the problem does not rely on the code itself. – Zufra Jul 30 '21 at 14:01
0

I can see some suggestions here on trying to increase the memory for your lambda (and I also saw that you tried from 512 to 1024). Have you tried increasing it further, say to about 3072? It's a significant increase, but this is just to prove that the problem is not due to resource limitations first.

The keep_warm feature isn't guaranteed as far as I've seen, and bulk of the (cold) start time is due to initialisation. Since the vcpu allocated to the lambda is proportional to the memory you assign to it, your lambda may initialise quicker and somehow mitigate these cold starts.

  • I tried increasing the memory to 1024 MB and the time it took for the long request to finish which was previously 13 seconds went to 7 seconds. It is still unacceptable and that is why I said it did not help, but what baffles me is how the increase in memory can have something to do with this when in the report of the lambda invocation it says every time that a maximum of 173 MB were used. It is not viable to double the memory allocated to the function (or multiply it by 6 as you suggest) as the price goes up proportionally. – Zufra Aug 10 '21 at 20:50
  • Understood. The value of 3072 was simply an arbitrarily large value to quickly (though temporarily) see if an increase would help. I believe you've already seen an improvement even by just bumping it to 1024, so the bottleneck here is likely due to the lambda's current resource constraints. Increasing memory also increases the vCPU allocation proportionally - so the extra compute power could be reducing the time. – fptiangco Aug 13 '21 at 15:47
  • Maybe it is CPU related but I still don't understand why it is only in one request and the next requests are all answered in a reasonable time. And the amount of time it takes for that request is preposterous and the code is not CPU intensive by any means; simple logic to get data from the database and return it. – Zufra Aug 14 '21 at 16:09