How to query all data from DynamoDB by multiple clients

Question

I have a DynamoDB table with 10 million records. I need to perform some calculations for the primary key of each of the records every 24 hours. To achieve this, I can query the entire table every 24 hours. Since a single DynamoDB query only returns 1MB of data, which will make the total query and calculation time more than 24 hours, I would like to have 10 workers do the DynamoDB query and the calculation. How should I query the table, so that a single record will only be retrieved by a single worker, and all the 10 million records will be retrieved eventually?

It seems like I need to save the LastEvaluatedKey somewhere so that worker 2 knows where the worker 1 query ends in order to keep querying the table.

score 2 · Answer 1 · answered May 25 '21 at 13:32

2

DDB Scan() includes functionality to do scanning in parallel.

However, the recommended method for handling aggregations in DDB is to use a Lambda with DDB streams and maintain your aggregations in the existing or even a new table.

Optionally you could use Redshift or Hive as mentioned in the answer to this question: How to do basic aggregation with DynamoDB?

answered May 25 '21 at 13:32

Charles

21,637
1
20
44

Using parallel scan with multi-threading helps to process records in reasonable time. Even 10 million of them. All of the AWS SDKs should make this pretty easy to implement without the need to handle `LastEvaluatedKey` yourself. I like your "aggregation using streams" option. – Jens May 25 '21 at 22:04

How to query all data from DynamoDB by multiple clients

1 Answers1