Serverless - DynamoDB (terrible) performances compared to RethinkDB + AWS Lambda

Question

In the process of migrating an existing Node.js (Hapi.js) + RethinkDB from an OVH VPS (smallest vps) to AWS Lambda (node) + DynamoDB, I've recently come across a very huge performance issue.

The usage is rather simple, people use an online tool, and "stuff" gets saved in the DB, passing through a node.js server/lambda. That "stuff" takes some spaces, around 3kb non-gzipped (a complex object with lots of keys and children, hence why using a NOSQL solution makes sense)

There is no issue with the saving itself (for now...), not so many people use the tool and there isn't much simultaneous writing to do, which makes sense to use a Lambda instead of a 24/7 running VPS.

The real issue is when I want to download those results.

Using Node+RethinkDB takes about 3sec to scan the whole table and generate a CSV file to download
AWS Lambda + DynamoDB timeout after 30sec, even if I paginate the results to download only 1000 items, it still takes 20 sec (no timeout this time, just very slow) -> There are 2200 items on that table, and we could deduce that we'd need around 45sec to download the whole table, if AWS Lambda wouldn't timeout after 30sec

So, the operation takes around 3s with RethinkDB, and would theoretically take 45sec with DynamoDB, for the same amount of fetched data.

Let's look at those data now. There are 2200 items in the table, for a total of 5MB, here are the DynamoDB stats:

Provisioned read capacity units 29 (Auto Scaling Enabled)
Provisioned write capacity units    25 (Auto Scaling Enabled)
Last decrease time  October 24, 2018 at 4:34:34 AM UTC+2
UTC: October 24, 2018 at 2:34:34 AM UTC

Local: October 24, 2018 at 4:34:34 AM UTC+2

Region (Ireland): October 24, 2018 at 2:34:34 AM UTC

Last increase time  October 24, 2018 at 12:22:07 PM UTC+2
UTC: October 24, 2018 at 10:22:07 AM UTC

Local: October 24, 2018 at 12:22:07 PM UTC+2

Region (Ireland): October 24, 2018 at 10:22:07 AM UTC

Storage size (in bytes) 5.05 MB
Item count  2,195

There is 5 provisioned read/write capacity units, with an autoscaling max to 300. But the autoscaling doesn't seem to scale as I'd expect, went from 5 to 29, could use 300 which would be enough to download 5MB in 30 sec, but doesn't use them (I'm just getting started with autoscaling so I guess it's misconfigured?)

Here we can see the effect of the autoscaling, which does increase the amount of read capacity units, but it does so too late and the timeout has happened already. I've tried to download the data several times in a row and didn't really see much improvements, even with 29 units.

The Lambda itself is configured with 128MB RAM, increasing to 1024MB has no effect (as I'd expect, it confirms the issue comes from DynamoDB scan duration)

So, all this makes me wonder why DynamoDB can't do in 30sec what does RethinkDB in 3sec, it's not related to any kind of indexing since the operation is a "scan", therefore must go through all items in the DB in any order.

I wonder how am I supposed to fetch that HUGE dataset (5MB!) with DynamoDB to generate a CSV.

And I really wonder if DynamoDB is the right tool for the job, I really wasn't expecting so low performances compared to what I've been using by the past (mongo, rethink, postgre, etc.)

I guess it all comes down to proper configuration (and there probably are many things to improve there), but even so, why is it such a pain to download a bunch of data? 5MB is not a big deal but there if feels like it requires a lot of efforts and attention, while it's just a common operation to export a single table (stats, dump for backup, etc.)

Edit: Since I created this question, I read https://hackernoon.com/the-problems-with-dynamodb-auto-scaling-and-how-it-might-be-improved-a92029c8c10b which explains in-depth the issue I've met. Basically, autoscaling is slow to trigger, which explains why it doesn't scale right with my use case. This article is a must-read if you want to understand how DynamoDB auto-scaling works.

What happens if you manually change the DynamoDB table throughput to 300 before running the Lambda function, instead of relying on auto-scaling? I think auto-scaling will happen too slow to help your single Lambda function invocation. Also, if your Lambda function is timing out at 30 seconds then you must have it configured that way. You need to increase the timeout value for the function. Also, increasing the available RAM in the Lambda function settings also increases the available CPU, so you might try bumping that up to the highest setting. — Mark B, Oct 24 '18 at 15:09
30 sec is the max allowed for a Lambda through APIGateway so I can't increase it, I tried to auto-scale it first, thinking it would handle the activity peak, and I didn't want to pay for 300 RCU when I barely use them most of the time but need them once in a while — Vadorequest, Oct 25 '18 at 09:43
Ah you didn't mention you were using API Gateway also. This sounds like more of a back-end process that could run on a schedule instead of being triggered by API Gateway. — Mark B, Oct 25 '18 at 12:27
yeah, didn't think to bring this on. The point being to download a file when someone needs it it made sense to do it that way, it wasn't an issue when it was taking a few sec with RethinkDB :) — Vadorequest, Oct 25 '18 at 18:32

score 1 · Answer 1 · answered Oct 24 '18 at 10:57

1

DynamoDB is not designed for that kind of usage. Its not like a traditional DB that you can just query as you wish and it especially does not do well with large datasets at a time such as the one you are requesting.

For this type of scenario, I actually use DyanamoDB streams to create a projection into an S3 bucket and then do large exports in that way. It will probably even be faster than the RethinkDB export you reference.

In short, DynamoDb is best as a transactional key-value store for known queries.

answered Oct 24 '18 at 10:57

Gareth McCumskey

1,510
7
12

I'd be interested in a more in-depth explanation, if you know of any good tutorial. :) Also, I'd like to know what you mean by "large datasets". 5MB isn't a "large dataset" at all, for me. – Vadorequest Oct 24 '18 at 12:24
For anyone interested by a more in-depth explanation, we talked about it extensively on Slack https://serverless-contrib.slack.com/archives/CA4QT5VU3/p1540378231000100 – Vadorequest Oct 25 '18 at 09:52

score 1 · Answer 2 · answered Oct 26 '18 at 08:57

I have come across exactly the same problem in my application (i.e. DynamoDB autoscaling does not kick in fast enough for an on-demand high intensity job).

I was pretty committed to DynamoDB by the time I can across the problem, so I worked around it. Here is what I did.

When I'm about to start a high-intensity job, I programatically increase the RCU and WCU on my DynamoDB table. In your case you could probably have one lambda to increase the throughput, then have that lambda kick off another one to do the high-intensity job. Note that increasing provision can take a few seconds, hence splitting this into a separate lambda is probably a good idea.

I will paste my personal notes on the problem I faced below. Apologies but I can't be bothered to format them into stackoverflow markup.

We want enough throughput provisioned all the time so that users have a fast experience, and even more importantly, don't get any failed operations. However, we only want to provision enough throughput to serve our needs, as it costs us money.

For the most part we can use Autoscaling on our tables, which should adapt our provisioned throughput to the amount actually being consumed (i.e. more users = more throughput automatically provisioned). This fails in two key aspects for us:

Autoscaling only increases throughput about 10 minutes after the throughput provision threshold is breached. When it does start scaling up, it is not very aggressive in doing so. There is a great blog on this here https://hackernoon.com/the-problems-with-dynamodb-auto-scaling-and-how-it-might-be-improved-a92029c8c10b. When there is literally zero consumption of throughput, DynamoDB does not decrease throughput. AWS Dynamo not auto-scaling back down The place we really need to manage throughput is on the Invoice table WCUs. RCUs are a lot cheaper than WCUs, so reads are less of a worry to provision. For most tables, provisioning a few RCU and WCU should be plenty. However, when we do an extract from the source, our write capacity on the Invoices table is high for a 30 minute period.

Lets imagine we just relied on Autoscaling. When a user kicked off an extract, we would have 5 minutes of burst capacity, which may or may not be enough throughput. Autoscaling would kick in after around 10 minutes (at best), but it would do so ponderously - not scaling up as fast we needed. Our provision would not be high enough, we would get throttled, and we would fail to get the data we wanted. If several processes were running concurrently, this problem would be even worse - we just couldn't handle multiple extracts at the same time.

Fortunately we know when we are about to beast the Invoices table, so we can programatically increase throughput on the Invoices table. Increasing throughput programatically seems to take effect very quickly. Probably within seconds. I noticed in testing that the Metrics view in DynamoDB is pretty useless. Its really slow to update and I think sometimes it just showed the wrong information. You can use AWS CLI to describe the table, and see what the throughput is provisioned at in real-time:

aws dynamodb describe-table --table-name DEV_Invoices

In theory we could just increase throughput when an extract started, and then reduce it again when we were finished. However, whilst you can increase throughput provision as often as you like, you can only decrease throughput provision 4 times in a day, although you can then decrease throughput once every hour (i.e. 27 times in 24 hours). https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Limits.html#default-limits-throughput. This approach is not going to work, as our decrease in provision might well fail.

Even if Autoscaling is in play, it still has to abide by the provisioning decrease rules. So if we've decreased 4 times, Autoscaling will have to wait an hour before decreasing again - and thats for both read and write values

Increasing throughput provision programatically is a good idea, we can do it fast (much faster than Autoscaling), so it works for our infrequent high workloads. We can't decrease throughput programtically after an extract (see above) but there are a couple of other options.

Autoscaling for throughput decrease

Note that even when Autoscaling is set, we can programtically change it to anything we like (e.g. higher than the maximum Autoscaling level).

We can just rely on Autoscaling to bring the capacity back down an hour or two after the extract has finished, that's not going to cost us too much.

There is another problem though. If our consumed capacity drops right down to zero after an extract, which is likely, no consumption data is sent to CloudWatch and Autoscaling doesn't do anything to reduce provisioned capacity, leaving us stuck on a high capacity.

There are two fudge options to fix this though. Firstly we can set the minimum and maximum throughput provision to be same the same value. So for example setting the minimum and maximum provisioned RCUs within Autoscaling to 20 will ensure that the provisioned capacity returns to 20, even if there is zero consumed capacity. Im not sure why but this works (i've tested it, and it does), AWS acknowledge the workaround here:

https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/AutoScaling.html

The other option is to create a Lambda function to attempt to execute a (failed) read and delete operation on the table every minute. Failed operations still consume capacity which is why this works. This job ensures data is sent to CloudWatch regularly, even when our 'real' consumption is zero, and therefore Autoscaling will reduce capacity correctly.

Note that read and write data is sent separately to CloudWatch. So if we want WCUs to decrease when real consumed WCUs are zero, we need to use a write operation (i.e. a delete). Similarly we need a read operation to make sure RCUs are updated. Note that failed Reads (if the item does not exist) and failed Deletes (if the item does not exist) but still consume throughput.

Lambda for throughput decrease

In the previous solution we used a Lambda function to continously 'poll' the table, thus creating the CloudWatch data which enables the DynamoDB Autoscaling to function. As an alternative we could just have a lambda which runs regularly and scales down the throughput when required. When you 'describe' a DynamoDB table, you get the current Provisioned Throughput as well as the last increase datetime and last decrease datetime. So the lambda could say: if the provisioned WCUs are over a threshold and the last time we had a throughput increase was more than half an hour ago (i.e. we are not in the middle of an extract), lets decrease the throughput right down.

Given that this is more code than the Autoscaling option, im not inclined to do this one.

Thanks for the very interesting feedback. Since I asked the question I also tried something similar, basically increase the RCU to 300 programatically, but I didn't know there was a delay before it was taken into account and I didn't notice any improvement. Also, I wasn't aware of the limitation about decreasing RCU through auto-scaling, didn't know it had limitations such as a number of time it can proc per hour/day, I was counting on auto-scaling too to decrease RCU and go back to normal state when they're not needed anymore. — Vadorequest, Oct 29 '18 at 13:16
When you have a DynamoDB table handle you can call a function called something like 'wait', which waits for the table to become active again after updating it's provision. If you don't call wait you may start using the table before the provision update has had time to take effect. — F_SO_K, Oct 30 '18 at 09:01
Thanks! In order to test this out I manually increased WCU to 300 (in order to simulate a dynamic increase) then called my endpoint. Well, it didn't work any better, the lambda still timed out despite having a much higher read capacity unit. It used about 17 WCU when 300 were available and couldn't reply in time. https://drive.google.com/open?id=1lsaHRfF8_ZOEez6x2u5Inf-0PjuGOgxZ — Vadorequest, Oct 30 '18 at 11:30
Looks like your process is not bound by WCU. Did you increase RCU as well? Not quite sure what your process is doing but if your function requires a read followed by a write, you could be RCU bound. — F_SO_K, Oct 30 '18 at 11:40
The lambda performed a scan operation which doesn't write anything but just returns the results, I also had increase RCU to 300 as well (and damn that's costly, changed settings back to normal after 20-30mn trying, 300 RCU/WCU is about $180/month, per table) --EDIT-- - I mentioned WCU in my other comment but meant RCU, anyway as explained I set both to 300 without improvement — Vadorequest, Oct 30 '18 at 20:03
I know it was only released a few days ago, but anyone attempted the same type of load testing with "on-demand" instances? I asked the PM during ReInvent on Twitch and he pretty much dismissed the question, as if on-demand will 'autmomagically' scale to any demand with no increased latency, but I'd imagine it uses the same (or similar) underlying system that handles auto-scaling for provisioned. We need to do low latency batch jobs on Dynamo and this is pretty concerning. — cantuket, Dec 02 '18 at 22:00
I hadn't seen this. Really interesting. The example they give shows on-demand scale-up happening MUCH faster than current provisioning. — F_SO_K, Dec 02 '18 at 22:15

Serverless - DynamoDB (terrible) performances compared to RethinkDB + AWS Lambda

2 Answers2

Linked