How you would approach backfilling the table would ultimately depend on the data size.
Small table
If your data size is in low GB's, then you can achieve it quite easily using Lambda/EC2/Local Machine etc....
You would need to Scan
or Parallel Scan
all the items in the table, filter out the items which do not require updating and then proceed to call UpdateItem
on each item in the result set and append the GSI keys.
Large table
If however, you have a large amount of data you may want to use a distributed system such as AWS Glue and Spark. Here, you would either read all the items directly from the table or read from an S3 Export and again obtain the keys which require updating. Then using Sparks forEachPartition
to distribute UpdateItem
across executors to update items.
Performance vs Cost
As with most services, there is a performance vs cost trade off. If you want to go fast, you'll likely incur more cost. If cost is an important factor for you, then you may want to do things slower. That would entail using a rate limited Scan rather than using Scan/Parallel Scan or even the Glue approach.
Rate limiting can be achieved using the Guava library in Java for example. This would ensure you don't consume too much capacity too quickly.
For larger tables its a little more difficult, as long running processes can die for various reasons, for that reason I would continually checkpoint which data you have read to avoid duplicate work when your restart your process.
Another important cost component is to use provisioned capacity mode when doing any sort of backfilling or data loading. On-demand mode is pay-per-request and for that reason it would not matter how fast or slow you go as you pay for each item read and written. Provisioned capacity mode offers significant cost savings over the alternative.