In this next blog post of the MicroServerlessLamba.NET series, we will review how we managed to make microservices calls in a more resilient way.
As previously covered, hosting via functions as a service is not a silver bullet and it has its own set of challenges. AWS Lambda is not an exception to this. Functions can scale with elastic loads with ease and the cold starts are usually not an issue when the caller can tolerate some delay in the response. When using microservices to drive a UI for active users, however, its a different story. We can’t expect to retain engaged and delighted users, if they have to wait up to 30 seconds for a cold Lambda to start up. Web app users expect sub-second responses for a snappy user interface.
Warming Lambdas
We tried warming lambdas and that was incrementally better. It succeeded in keeping a set number of Lambdas warm over time. The problem was the warming process itself blocked real service calls from executing. In addition, the more Lambdas you needed warm, the longer the warming process took.
We used this process for a while but it didn’t scale as well as we would like and needed a better solution.
Fault Resilience
Initially we started with only REST services and used AutoRest to generate service clients from Swagger documentation. We didn’t realize it at the time but AutoRest performs retries by default in their generated clients. Looking back this is why the Lambda warming worked as well as it did. Without these retries, this approach would have been far more problematic.
Then we started using OData for our internal services and its client did not have automatic retries. We quickly felt the pain of the blocking warming calls.
So we added Polly to the architecture to provide more resilience to our clients. We configured retries with exponential back-off, so our service calls would be more persistent. We applied this policy to our REST and OData client as well as any other HttpClients in use.
Timeouts
Even with the retries, when all Lambda instances were busy and another concurrent request was made a new Lambda instance started (cold start). This concurrent request waited for the cold start to complete before it received a response or it timed out. This again created a poor user experience.
The issue was we hadn’t changed the default timeout of 30 seconds. Some Lambdas can take more than 30 seconds to cold start. Most of our service calls however took a 1 second or less to respond on average with a warm Lambda. The solution was to decrease out timeout to a value that was slightly higher than the average execution time for the Lambda. So if the average was 500 ms, we should timeout at 1 second. That way if the call takes longer than the average warm lambda it would timeout and then try again.
This process helped tremendously. A cold start would still produce a slower response but it was far more tolerable at 1 second longer than normal. The likelyhood of a retry hitting a second cold start was much lower. In most cases, the first retry would hit one of the warm Lambda which had then completed its prior request.
Other Solutions
Last year AWS announced Provisioned Concurrency which pre-provisions your Lambdas so they are immediately ready to serve requests (no cold start). If you have a consistent expected usage, this may be the perfect solution for you. It can be costly, however. If your workload is constant, not fluctuating much/often, then Lambda may not be a good execution model for you. It can do it but it will be at a higher cost than alternatives.
Hopefully this post helps you think through how to overcome cold start issues. Have another solution or question? Leave a comment below to continue the conversation.
Happy Severless Coding!
Hi Robb,
Wondering if you have considered using AppSync. This is AWS’ GraphQL implementation. I would think this would give you the best user interface experience. Another option could be to use API Gateway caching.
We have been using Lambda with .Net Core and Node behind API Gateway for a few years now and haven’t really found issues with response time. We have very high traffic and never really suffered from cold start problems but I think the reserved capacity and hyperplane ENIs have mostly taken care of that problem.
I just listened to your conversation with the .Net Rocks guys and left a comment over there.
Great to hear of another environment so similar to ours. Would love to swap war stories if you are interested.
Thanks,
Pedro.
Thanks for sharing Pedro!
Are you using Lambda to host your .NET Core code? We found the cold start times in Lambda can cause delays in response time.
Hi Rodd,
We have never had cold start problems in Production. In Dev we did because they are not hit as much as Prod but the reserved capacity seems to have solved that.
Our Prod environment is extremely busy. We have about 370k daily users just in our mobile application.
On a separate question. What are you using for authentication? Are you using Cognito or something else?
Thanks.
Our load is very elastic, not consistently high. So we see the cold starts a lot.
We use SQL auth but were making plans to move to Cognito.
Actually changed jobs last week and will be working in Azure now.