1

I have an existing table with all the data, around 1 million records.

I needed to add GSI to the table to better query some data. However, with app logic the GSI is being populated only for new records. The partition key and sort key for GSI can be populated using other attributes of the record.

What is the most efficient/ recommended way to back fill the GSI PK and SK for the existing records in the table?

Maurice
  • 11,482
  • 2
  • 25
  • 45
Ajay Narain Mathur
  • 5,326
  • 2
  • 20
  • 32

1 Answers1

3

How you would approach backfilling the table would ultimately depend on the data size.

Small table

If your data size is in low GB's, then you can achieve it quite easily using Lambda/EC2/Local Machine etc....

You would need to Scan or Parallel Scan all the items in the table, filter out the items which do not require updating and then proceed to call UpdateItem on each item in the result set and append the GSI keys.

Large table

If however, you have a large amount of data you may want to use a distributed system such as AWS Glue and Spark. Here, you would either read all the items directly from the table or read from an S3 Export and again obtain the keys which require updating. Then using Sparks forEachPartition to distribute UpdateItem across executors to update items.

Performance vs Cost

As with most services, there is a performance vs cost trade off. If you want to go fast, you'll likely incur more cost. If cost is an important factor for you, then you may want to do things slower. That would entail using a rate limited Scan rather than using Scan/Parallel Scan or even the Glue approach.

Rate limiting can be achieved using the Guava library in Java for example. This would ensure you don't consume too much capacity too quickly.

For larger tables its a little more difficult, as long running processes can die for various reasons, for that reason I would continually checkpoint which data you have read to avoid duplicate work when your restart your process.

Another important cost component is to use provisioned capacity mode when doing any sort of backfilling or data loading. On-demand mode is pay-per-request and for that reason it would not matter how fast or slow you go as you pay for each item read and written. Provisioned capacity mode offers significant cost savings over the alternative.

Leeroy Hannigan
  • 11,409
  • 3
  • 14
  • 31
  • I think a word of warning concerning Read/Write Throughput/Cost for these "migrations" may be in order, otherwise an excellent answer. – Maurice May 24 '23 at 09:52
  • Good point Maurice, I'll add a cost component. – Leeroy Hannigan May 24 '23 at 11:04
  • Thanks @LeeHannigan, I will give the large table section a shot. – Ajay Narain Mathur May 24 '23 at 11:28
  • I actually just did something similar to populate a new TTL attribute on 11M records using EMR (Spark and Hadoop). This article was my guide https://aws.amazon.com/blogs/database/backfilling-an-amazon-dynamodb-time-to-live-ttl-attribute-with-amazon-emr/ – Charles May 24 '23 at 13:08
  • You have to be careful here, as the guide you shared is an overwrite operation, and you could corrupt data easily. I suggested approaches which use UpdateItem to avoid data corruption. – Leeroy Hannigan May 24 '23 at 13:13