0

I have a DynamoDB table with 10 million records. I need to perform some calculations for the primary key of each of the records every 24 hours. To achieve this, I can query the entire table every 24 hours. Since a single DynamoDB query only returns 1MB of data, which will make the total query and calculation time more than 24 hours, I would like to have 10 workers do the DynamoDB query and the calculation. How should I query the table, so that a single record will only be retrieved by a single worker, and all the 10 million records will be retrieved eventually?

It seems like I need to save the LastEvaluatedKey somewhere so that worker 2 knows where the worker 1 query ends in order to keep querying the table.

SmartFingers
  • 105
  • 1
  • 8

1 Answers1

2

DDB Scan() includes functionality to do scanning in parallel.

However, the recommended method for handling aggregations in DDB is to use a Lambda with DDB streams and maintain your aggregations in the existing or even a new table.

Optionally you could use Redshift or Hive as mentioned in the answer to this question: How to do basic aggregation with DynamoDB?

Charles
  • 21,637
  • 1
  • 20
  • 44
  • Using parallel scan with multi-threading helps to process records in reasonable time. Even 10 million of them. All of the AWS SDKs should make this pretty easy to implement without the need to handle `LastEvaluatedKey` yourself. I like your "aggregation using streams" option. – Jens May 25 '21 at 22:04