1

As the question says, I have about 200million data records in a table in dynamoDB. I am writing a script in nodejs that needs to delete all data without a TTL. I have 3 ideas for this, and I am curious on thoughts about how I should go about doing this on so many records.

  1. batchWrite, this option I would scan then paginate through the whole table, deleting each record as it meets the condition of not currently having a ttl

  2. push all records that dont have a ttl to a new table and then delete that table all at once

  3. set a ttl for records that dont have one, but I cant find any information if this is even a thing or if I can somehow bulk add a ttl to all records without one

Any information is helpful, please let me know how I can go about doing this! Thank you

kayvera
  • 41
  • 9
  • 1
    I don't understand how your 2nd idea would work. If you copied all records from one table to a second, and then deleted the second table, they are still going to exist in the first table. For the 3rd option you could look at this https://aws.amazon.com/blogs/database/backfilling-an-amazon-dynamodb-time-to-live-ttl-attribute-with-amazon-emr/ – Mark B Jan 15 '21 at 21:16
  • sorry I meant copy over the correct records to the new table and delete the old, my mistake. Do you think the third idea would cost the least? – kayvera Jan 15 '21 at 21:19
  • I have no idea on the costs. If you have to spin up some EMR servers for the 3rd option that would probably cost the most, but it may be the only way to get it done in a reasonable amount of time. – Mark B Jan 15 '21 at 21:21
  • @kayvera needs to delete all data without a TTL - This means that you want to delete only those records which currently don't have TTL field? Your table has records which have TTL? – Anshul Saraswat Kaul Jan 15 '21 at 21:41
  • yes, I have a table where some data has a TTL and some doesn't. I am just not sure which way is the most cost effective way to handle this. The link @MarkB added would actually probably work but so would the other two ideas, I just don't know which one is the best one to do – kayvera Jan 15 '21 at 21:45

2 Answers2

2

I would go with option 1 -

Check Parallel Scan doc, pasting some information here.

Segment — A segment to be scanned by a particular worker. Each worker should use a different value for Segment.
TotalSegments — The total number of segments for the parallel scan. This value must be the same as the number of workers that your application will use.

Here, each segment will work on each partition DDB has made on your table. Each partition in DDB is of 10GB. With scanning/ read made faster, we can now perform deletes using BatchWrites.

  • @kayvera See this [link](https://stackoverflow.com/questions/9154264/what-is-the-recommended-way-to-delete-a-large-number-of-items-from-dynamodb?rq=1) to get more details. Let us know if this helped! – Anshul Saraswat Kaul Jan 16 '21 at 08:21
0

I would do it like this: (option 1)

import boto3

# Create a DynamoDB client
dynamodb = boto3.client('dynamodb')

# Name of the table to remove entries from
table_name = 'my-table'

# Get all items from the table
response = dynamodb.scan(
    TableName=table_name
)

# Iterate over the items
for item in response['Items']:
    # Check if the item has a TTL attribute
    if 'ttl' not in item:
        # Delete the item if it does not have a TTL attribute
        dynamodb.delete_item(
            TableName=table_name,
            Key={
                'id': item['id']
            }
        )
JLT
  • 712
  • 9
  • 15