Backup only new records from DynamoDB to S3 and load them into RedShift

Question

I saw that similar questions already exist:

Backup AWS Dynamodb to S3

Copying only new records from AWS DynamoDB to AWS Redshift

Loading data from Amazon dynamoDB to redshift

Unfortunately most of them are outdated (since amazon introduced new services) and/or have different answers.

In my case I have two databases (RedShift and DynamoDB) and I have to:

Keep RedShift database up-to-date
Store database backup on S3

To do that I want to use that approach:

Backup only new/modified records from DynamoDB to S3 at the end of the day. (1 file per day)
Update RedShift database using file from S3

So my question is what is the most efficient way to do that?

I read this tutorial but I am not sure that AWS Data Pipeline could be configured to "catch" only new records from DynamoDB. If that is not possible, scanning entire database every time is not an option.

Thank you in advance!

score 3 · Accepted Answer · answered Mar 22 '16 at 13:07

3

you can use Amazon Lambda with dynamodb stream (documentation)

you can configure your lambda function to get updated records (from dynamodb stream) and then update redshift db

answered Mar 22 '16 at 13:07

Eyal Ch

9,552
5
44
54

In that example they are using Lambda function attached to DynamoDB Stream. Lambda is invoked on every DynamoDB change (insert/modify). That way I should read whole "log" file from S3 then update it. Maybe I need to schedule Lambda to be executed only once at the end of the day. – bpavlov Mar 22 '16 at 13:59
i dont understand the use of s3 log file. with lambda you dont need it – Eyal Ch Mar 22 '16 at 14:07
I am using s3 file as a database backup. Another reason is: "COPY loads large amounts of data much more efficiently than using INSERT statements, and stores the data more effectively as well." Also if I need to restore some data I do not need to read it again from DynamoDb – bpavlov Mar 22 '16 at 14:13
yes, so in this case you can log to a file all dynamodb updates, and then load it from s3 to dynamodb – Eyal Ch Mar 22 '16 at 14:26
Yes, that is what I want. And I was thinking that I could achieve this with Data Pipeline. But if there is no other answers - I should write a lambda(s) that do this. – bpavlov Mar 22 '16 at 14:38
@bpavlov I still don't understand why you need to have S3 involved in all of this. What's the problem of triggering an update on RedShift for every update on DynamoDB? – Matias Cicero Mar 22 '16 at 15:05
If I am not missing something - according to documentation "COPY is more efficient than INSERT statement and stores the data more effectively as well". But the main reason is that I would have database backup on S3. Then this data could be moved to Amazon Glacier and etc.. – bpavlov Mar 22 '16 at 15:18
@EyalCh I am facing a problem described here: http://stackoverflow.com/questions/31912168/reading-data-from-dynamodb-streams while I am reading the stream. NextShardIterator is always NOT null so I come up with an infinite loop. What is the best approach in this situation? What I was doing was to iterate only in closed shards and keep last ShardId logged but this is not a good solution at all. – bpavlov Jun 13 '16 at 12:18
Maybe it is a good idea to open new topic: http://stackoverflow.com/questions/37814516/reading-aws-dynamodb-stream – bpavlov Jun 14 '16 at 14:14
Just to answer " What's the problem of triggering an update on RedShift for every update on DynamoDB?". This would not scale as a single insert is a leader node operation meaning it will become a bottleneck as you handle more data. So much so that writes can take many seconds depending on the volume of data at some point the writes will start to time out all together as the leader node becomes swamped down in INSERTS. Thats only COPY command can scale over long term as it gets all nodes in cluster in insert data. – phill.tomlinson Nov 17 '16 at 08:02

Backup only new records from DynamoDB to S3 and load them into RedShift

1 Answers1