Backup AWS Dynamodb to S3

Question

It has been suggested on Amazon docs http://aws.amazon.com/dynamodb/ among other places, that you can backup your dynamodb tables using Elastic Map Reduce,
I have a general understanding of how this could work but I couldn't find any guides or tutorials on this,

So my question is how can I automate dynamodb backups (using EMR)?

So far, I think I need to create a "streaming" job with a map function that reads the data from dynamodb and a reduce that writes it to S3 and I believe these could be written in Python (or java or a few other languages).

Any comments, clarifications, code samples, corrections are appreciated.

@CoryKendall I made it a little easier. Added an alternate answer below. — Veer Abheek Singh Manhas, Jun 20 '19 at 12:32

score 38 · Accepted Answer · edited Mar 14 '14 at 18:49

38

With introduction of AWS Data Pipeline, with a ready made template for dynamodb to S3 backup, the easiest way is to schedule a back up in the Data Pipeline [link],

In case you have special needs (data transformation, very fine grain control ...) consider the answer by @greg

edited Mar 14 '14 at 18:49

Jeroen van Dijk

1,029
10
16

answered Feb 04 '13 at 15:19

Ali

18,665
21
103
138

5

Beware that by default, an m3.xlarge instance is used for the Data Pipeline. I ran the backup every 6 hours and ended up with costs > $1 per day - almost half of my total AWS charges. For businesses it's totally worth it, but for pet projects where you hardly want to spend a lot of money it should be considered. – Ben Romberg Dec 03 '16 at 09:56
1

There are now automated backups built directly into Dynamo DB. However, these backups disappear when the table itself is deleted so I don't know how perfect they are. The pipeline just saved my butt in the above scenario - the table was accidentally deleted but all the data was in S3 - not the DynamoDB service. – fIwJlxSzApHEZIl May 31 '18 at 04:45
@anon58192932 Are you sure that the backup's lifecycle depends on the one of the table? In the official description for restoring a backup you can find under step 5 the posibility to create a new table with a new name. So basically you can start up several copies of a table, if I understand it right. Which backups are deleted when one of the table copies are deleted? (link: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Restore.Tutorial.html ) – Elmar Macek Jan 09 '19 at 09:34
I haven't worked with DynamoDB in a while so I can't be 100% sure. It's possible that the UI elements just disappeared after deleting a table but the data was still there. Since AWS is always in a state of flux, it's very possible things have changed since I was working with DynamoDB day-in and day-out. – fIwJlxSzApHEZIl Jan 11 '19 at 15:40

greg · Answer 2 · 2013-01-18T19:54:38.210

There are some good guides for working with MapReduce and DynamoDB. I followed this one the other day and got data exporting to S3 going reasonably painlessly. I think your best bet would be to create a hive script that performs the backup task, save it in an S3 bucket, then use the AWS API for your language to pragmatically spin up a new EMR job flow, complete the backup. You could set this as a cron job.

Example of a hive script exporting data from Dynamo to S3:

CREATE EXTERNAL TABLE my_table_dynamodb (
    company_id string
    ,id string
    ,name string
    ,city string
    ,state string
    ,postal_code string)
 STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
 TBLPROPERTIES ("dynamodb.table.name"="my_table","dynamodb.column.mapping" = "id:id,name:name,city:city,state:state,postal_code:postal_code");

CREATE EXTERNAL TABLE my_table_s3 (
    ,id string
    ,name string
    ,city string
    ,state string
    ,postal_code string)
 ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
 LOCATION 's3://yourBucket/backup_path/dynamo/my_table';

 INSERT OVERWRITE TABLE my_table_s3
 SELECT * from my_table_dynamodb;

Here is an example of a PHP script that will spin up a new EMR job flow:

$emr = new AmazonEMR();

$response = $emr->run_job_flow(
            'My Test Job',
            array(
                "TerminationProtected" => "false",
                "HadoopVersion" => "0.20.205",
                "Ec2KeyName" => "my-key",
                "KeepJobFlowAliveWhenNoSteps" => "false",
                "InstanceGroups" => array(
                    array(
                        "Name" => "Master Instance Group",
                        "Market" => "ON_DEMAND",
                        "InstanceType" => "m1.small",
                        "InstanceCount" => 1,
                        "InstanceRole" => "MASTER",
                    ),
                    array(
                        "Name" => "Core Instance Group",
                        "Market" => "ON_DEMAND",
                        "InstanceType" => "m1.small",
                        "InstanceCount" => 1,
                        "InstanceRole" => "CORE",
                    ),
                ),
            ),
            array(
                "Name" => "My Test Job",
                "AmiVersion" => "latest",
                "Steps" => array(
                    array(
                        "HadoopJarStep" => array(
                            "Args" => array(
                                "s3://us-east-1.elasticmapreduce/libs/hive/hive-script",
                                "--base-path",
                                "s3://us-east-1.elasticmapreduce/libs/hive/",
                                "--install-hive",
                                "--hive-versions",
                                "0.7.1.3",
                            ),
                            "Jar" => "s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar",
                        ),
                        "Name" => "Setup Hive",
                        "ActionOnFailure" => "TERMINATE_JOB_FLOW",
                    ),
                    array(
                        "HadoopJarStep" => array(
                            "Args" => array(
                                "s3://us-east-1.elasticmapreduce/libs/hive/hive-script",
                                "--base-path",
                                "s3://us-east-1.elasticmapreduce/libs/hive/",
                                "--hive-versions",
                                "0.7.1.3",
                                "--run-hive-script",
                                "--args",
                                "-f",
                                "s3n://myBucket/hive_scripts/hive_script.hql",
                                "-d",
                                "INPUT=Var_Value1",
                                "-d",
                                "LIB=Var_Value2",
                                "-d",
                                "OUTPUT=Var_Value3",
                            ),
                            "Jar" => "s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar",
                        ),
                        "Name" => "Run Hive Script",
                        "ActionOnFailure" => "CANCEL_AND_WAIT",
                    ),
                ),
                "LogUri" => "s3n://myBucket/logs",
            )
        );

}

You use "$this", but there's no that... "$emr = $this->get('aws_EMR');" — kingjeffrey, Jan 18 '13 at 00:29
@greg is there a way to apply some function to all rows in a hive script? I.e. I know all my rows have some `field` in the DyanmoDB table and I want the output backup to to have `f(field)` where `f` is some function? — sedavidw, Mar 02 '16 at 21:44
@sedavidw you would need to change the last INSERT query to INSERT OVERWRITE my_table_s3 SELECT col1, col2, f(col3) ... FROM my_table_dynamodb — Swaranga Sarma, Jul 19 '17 at 11:08

Steven de Salas · Answer 3 · 2017-04-25T00:28:50.723

AWS Data Pipeline is costly and the complexity of managing a templated process cannot compare to a simplicity of a CLI command you can make changes to and run on a schedule (using cron, Teamcity or your CI tool of choice)

Amazon promotes Data Pipeline as they make a profit on it. I'd say that it only really makes sense if you have a very large database (>3GB), as the performance improvement will justify it.

For small and medium databases (1GB or less) I'd recommend you use one of the many tools available, all three below can handle backup and restore processes from the command line:

dynamo-backup-to-s3 ==> Streaming restore to S3, using NodeJS/npm
SEEK-Jobs dynamotools ==> Streaming restore to S3, using Golang
dynamodump ==> Local backup/restore using python, upload/download S3 using aws s3 cp

Bear in mind that due to bandwidth/latency issues these will always perform better from an EC2 instance than your local network.

Abhaya Chauhan · Answer 4 · 2017-12-04T12:42:50.293

With the introduction of DynamoDB Streams and Lambda - you should be able to take backups and incremental backups of your DynamoDB data.

You can associate your DynamoDB Stream with a Lambda Function to automatically trigger code for every data update (Ie: data to another store like S3)

A lambda function you can use to tie up with DynamoDb for incremental backups:

https://github.com/PageUpPeopleOrg/dynamodb-replicator

I've provided a detailed explanation how you can use DynamoDB Streams, Lambda and S3 versioned buckets to create incremental backups for your data in DynamoDb on my blog:

https://www.abhayachauhan.com/category/aws/dynamodb/dynamodb-backups

Edit:

As of Dec 2017, DynamoDB has released On Demand Backups/Restores. This allows you to take backups and store them natively in DynamoDB. They can be restored to a new table. A detailed walk through is provided here, including code to schedule them:

https://www.abhayachauhan.com/2017/12/dynamodb-scheduling-on-demand-backups

HTH

while that link has a lot of useful information and details, I definitely wouldn't consider it a 'detailed walkthrough'. A walk through would actually involve steps. How do we setup ddb/streams/lambda/s3? how do we configure the replicator and make lambda use it? what IAM permisisons are involved in the 4 pieces working together? etc etc.... — keen, Jun 20 '16 at 18:52

score 5 · Answer 5 · answered Jun 02 '13 at 08:56

5

You can use my simple node.js script dynamo-archive.js, which scans an entire Dynamo table and saves output to a JSON file. Then, you upload it to S3 using s3cmd.

answered Jun 02 '13 at 08:56

yegor256

102,010
123
446
597

score 4 · Answer 6 · answered Apr 15 '14 at 06:59

4

You can use this handy dynamodump tool which is python based (uses boto) to dump the tables into JSON files. And then upload to S3 with s3cmd

answered Apr 15 '14 at 06:59

250R

35,945
7
33
25

Why not use the Data Pipeline tool, provided by amazon, part of AWS? – Ali Apr 15 '14 at 13:01
4

@Ali The Data Pipeline tool is extremely slow and costly due to the EMR clusters. It often leaves these clusters running past the termination period. A simple backup script can do the same job in 1/100th the time for negligible cost. – Jake Oct 03 '15 at 02:48

score 1 · Answer 7 · answered Dec 03 '16 at 10:25

I found the dynamodb-backup lambda function to be really helpful. Took me 5 minutes to setup and can easily be configured to use a Cloudwatch Schedule event (don't forget to run npm install in the beginning though).

It's also a lot cheaper for me coming from Data Pipeline (~$40 per month), I estimate the costs to be around 1.5 cents per month (both without S3 storage). Note that it backs up all DynamoDB tables at once by default, which can easily be adjusted within the code.

The only missing part is to be notified if the function fails, which the Data Pipeline was able to do.

score 1 · Answer 8 · answered Jul 11 '18 at 22:39

1

aws data pipeline has limit regions.

It took me 2 hours to debug the template.

https://docs.aws.amazon.com/general/latest/gr/rande.html#datapipeline_region

answered Jul 11 '18 at 22:39

xichen

339
2
6

score 0 · Answer 9 · answered Apr 11 '21 at 18:44

0

You can now backup your DynamoDB data straight to S3 natively, without using Data Pipeline or writing custom scripts. This is probably the easiest way to achieve what you wanted because it does not require you to write any code and run any task/script because it's fully managed.

answered Apr 11 '21 at 18:44

Rafal Wiliński

2,240
1
21
26

1

For large tables consider the additional costs (point in time recovery and the cost of the export). – cementblocks May 18 '21 at 20:39

score -1 · Answer 10 · answered Nov 10 '22 at 14:10

Since 2020 you can export a DynamoDB table to S3 directly in the AWS UI:

https://aws.amazon.com/blogs/aws/new-export-amazon-dynamodb-table-data-to-data-lake-amazon-s3/

You need to activate PITR (Point in Time Recovery) first. You can choose between JSON and Amazon ION format.

In the Java SDK (Version 2), you can do something like this:

// first activate PITR
PointInTimeRecoverySpecification pointInTimeRecoverySpecification  = PointInTimeRecoverySpecification
    .builder()
    .pointInTimeRecoveryEnabled(true)
    .build();
UpdateContinuousBackupsRequest updateContinuousBackupsRequest = UpdateContinuousBackupsRequest
    .builder()
    .tableName(myTable.getName())
    .pointInTimeRecoverySpecification(pointInTimeRecoverySpecification)
    .build();

UpdateContinuousBackupsResponse updateContinuousBackupsResponse;
try{
    updateContinuousBackupsResponse = dynamoDbClient.updateContinuousBackups(updateContinuousBackupsRequest);
}catch(Exception e){
    log.error("Point in Time Recovery Activation failed: {}",e.getMessage());
}
String updatedPointInTimeRecoveryStatus=updateContinuousBackupsResponse
    .continuousBackupsDescription()
    .pointInTimeRecoveryDescription()
    .pointInTimeRecoveryStatus()
    .toString();
log.info("Point in Time Recovery for Table {} activated: {}",myTable.getName(),
updatedPointInTimeRecoveryStatus);

// ... now get the table ARN
DescribeTableRequest describeTableRequest=DescribeTableRequest
    .builder()
    .tableName(myTable.getName())
    .build();

DescribeTableResponse describeTableResponse = dynamoDbClient.describeTable(describeTableRequest);
String tableArn = describeTableResponse.table().tableArn();
String s3Bucket = "myBucketName";

// choose the format (JSON or ION)
ExportFormat exportFormat=ExportFormat.ION;
ExportTableToPointInTimeRequest exportTableToPointInTimeRequest=ExportTableToPointInTimeRequest
    .builder()
    .tableArn(tableArn)
    .s3Bucket(s3Bucket)
    .s3Prefix(myTable.getS3Prefix())
    .exportFormat(exportFormat)
    .build();
dynamoDbClient.exportTableToPointInTime(exportTableToPointInTimeRequest);

Your dynamoDbClient needs to be an instance of software.amazon.awssdk.services.dynamodb.DynamoDbClient, the DynamoDbEnhancedClient or DynamoDbEnhancedAsyncClient will not work.

Backup AWS Dynamodb to S3

10 Answers10

Linked

Related