115

I'm looking for some advice or best practice to back up S3 bucket.
The purpose of backing up data from S3 is to prevent data loss because of the following:

  1. S3 issue
  2. issue where I accidentally delete this data from S3

After some investigation I see the following options:

  1. Use versioning http://docs.aws.amazon.com/AmazonS3/latest/dev/Versioning.html
  2. Copy from one S3 bucket to another using AWS SDK
  3. Backup to Amazon Glacier http://aws.amazon.com/en/glacier/
  4. Backup to production server, which is itself backed up

What option should I choose and how safe would it be to store data only on S3? Want to hear your opinions.
Some useful links:

Niklas Ekman
  • 943
  • 1
  • 9
  • 26
Sergey Alekseev
  • 11,910
  • 11
  • 38
  • 53

8 Answers8

74

Originally posted on my blog: http://eladnava.com/backing-up-your-amazon-s3-buckets-to-ec2/

Sync Your S3 Bucket to an EC2 Server Periodically

This can be easily achieved by utilizing multiple command line utilities that make it possible to sync a remote S3 bucket to the local filesystem.

s3cmd
At first, s3cmd looked extremely promising. However, after trying it on my enormous S3 bucket -- it failed to scale, erroring out with a Segmentation fault. It did work fine on small buckets, though. Since it did not work for huge buckets, I set out to find an alternative.

s4cmd
The newer, multi-threaded alternative to s3cmd. Looked even more promising, however, I noticed that it kept re-downloading files that were already present on the local filesystem. That is not the kind of behavior I was expecting from the sync command. It should check whether the remote file already exists locally (hash/filesize checking would be neat) and skip it in the next sync run on the same target directory. I opened an issue (bloomreach/s4cmd/#46) to report this strange behavior. In the meantime, I set out to find another alternative.

awscli
And then I found awscli. This is Amazon's official command line interface for interacting with their different cloud services, S3 included.

AWSCLI

It provides a useful sync command that quickly and easily downloads the remote bucket files to your local filesystem.

$ aws s3 sync s3://your-bucket-name /home/ubuntu/s3/your-bucket-name/

Benefits:

  • Scalable - supports huge S3 buckets
  • Multi-threaded - syncs the files faster by utilizing multiple threads
  • Smart - only syncs new or updated files
  • Fast - thanks to its multi-threaded nature and smart sync algorithm

Accidental Deletion

Conveniently, the sync command won't delete files in the destination folder (local filesystem) if they are missing from the source (S3 bucket), and vice-versa. This is perfect for backing up S3 -- in case files get deleted from the bucket, re-syncing it will not delete them locally. And in case you delete a local file, it won't be deleted from the source bucket either.

Setting up awscli on Ubuntu 14.04 LTS

Let's begin by installing awscli. There are several ways to do this, however, I found it easiest to install it via apt-get.

$ sudo apt-get install awscli

Configuration

Next, we need to configure awscli with our Access Key ID & Secret Key, which you must obtain from IAM, by creating a user and attaching the AmazonS3ReadOnlyAccess policy. This will also prevent you or anyone who gains access to these credentials from deleting your S3 files. Make sure to enter your S3 region, such as us-east-1.

$ aws configure

aws configure

Preparation

Let's prepare the local S3 backup directory, preferably in /home/ubuntu/s3/{BUCKET_NAME}. Make sure to replace {BUCKET_NAME} with your actual bucket name.

$ mkdir -p /home/ubuntu/s3/{BUCKET_NAME}

Initial Sync

Let's go ahead and sync the bucket for the first time with the following command:

$ aws s3 sync s3://{BUCKET_NAME} /home/ubuntu/s3/{BUCKET_NAME}/

Assuming the bucket exists, the AWS credentials and region are correct, and the destination folder is valid, awscli will start to download the entire bucket to the local filesystem.

Depending on the size of the bucket and your Internet connection, it could take anywhere from a few seconds to hours. When that's done, we'll go ahead and set up an automatic cron job to keep the local copy of the bucket up to date.

Setting up a Cron Job

Go ahead and create a sync.sh file in /home/ubuntu/s3:

$ nano /home/ubuntu/s3/sync.sh

Copy and paste the following code into sync.sh:

#!/bin/sh

# Echo the current date and time

echo '-----------------------------'
date
echo '-----------------------------'
echo ''

# Echo script initialization
echo 'Syncing remote S3 bucket...'

# Actually run the sync command (replace {BUCKET_NAME} with your S3 bucket name)
/usr/bin/aws s3 sync s3://{BUCKET_NAME} /home/ubuntu/s3/{BUCKET_NAME}/

# Echo script completion
echo 'Sync complete'

Make sure to replace {BUCKET_NAME} with your S3 bucket name, twice throughout the script.

Pro tip: You should use /usr/bin/aws to link to the aws binary, as crontab executes commands in a limited shell environment and won't be able to find the executable on its own.

Next, make sure to chmod the script so it can be executed by crontab.

$ sudo chmod +x /home/ubuntu/s3/sync.sh

Let's try running the script to make sure it actually works:

$ /home/ubuntu/s3/sync.sh

The output should be similar to this:

sync.sh output

Next, let's edit the current user's crontab by executing the following command:

$ crontab -e

If this is your first time executing crontab -e, you'll need to select a preferred editor. I'd recommend selecting nano as it's the easiest for beginners to work with.

Sync Frequency

We need to tell crontab how often to run our script and where the script resides on the local filesystem by writing a command. The format for this command is as follows:

m h  dom mon dow   command

The following command configures crontab to run the sync.sh script every hour (specified via the minute:0 and hour:* parameters) and to have it pipe the script's output to a sync.log file in our s3 directory:

0 * * * * /home/ubuntu/s3/sync.sh > /home/ubuntu/s3/sync.log

You should add this line to the bottom of the crontab file you are editing. Then, go ahead and save the file to disk by pressing Ctrl + W and then Enter. You can then exit nano by pressing Ctrl + X. crontab will now run the sync task every hour.

Pro tip: You can verify that the hourly cron job is being executed successfully by inspecting /home/ubuntu/s3/sync.log, checking its contents for the execution date & time, and inspecting the logs to see which new files have been synced.

All set! Your S3 bucket will now get synced to your EC2 server every hour automatically, and you should be good to go. Do note that over time, as your S3 bucket gets bigger, you may have to increase your EC2 server's EBS volume size to accommodate new files. You can always increase your EBS volume size by following this guide.

Community
  • 1
  • 1
Elad Nava
  • 7,746
  • 2
  • 41
  • 61
  • I've left a question on your blog, but I wondred whether there's a way of sync'ing the metadata too? – RobbiewOnline May 17 '18 at 10:24
  • @Devology Ltd, Unfortunately I haven't had a chance to work with S3 object metadata. From a quick Google search, it doesn't seem like the `awscli` supports syncing this automatically in the `aws s3 sync` command. It looks like you may have to implement this manually. – Elad Nava May 17 '18 at 23:00
  • Thanks @Ekad Nava - I appreciate you confirming what I believed was the case. – RobbiewOnline May 18 '18 at 09:59
  • 1
    This is fantastic @EladNava thanks for sharing, still relevant in 2020! – user1130176 Jan 16 '20 at 21:15
  • this answer doesn't fit, when you have millions of files in it. It becomes very expensive, slow and sometimes impossible - because of limits on filesystem. – Psychozoic May 14 '20 at 07:07
  • @Psychozoic if you choose a filesystem with a configurable inode limit (e.g. `ext4`), and you can provision a large enough EBS volume size, it is feasible to back up any number / size of files, yet can indeed become expensive. You don't necessarily have to sync to an EC2 server, instead you can sync to a local computer. But you will always pay AWS bandwidth costs. – Elad Nava May 14 '20 at 23:13
  • The `aws s3 sync` command only syncs the first 1000 objects. How can I sync the complete bucket? – RamNow Mar 24 '22 at 08:39
34

Taking into account the related link, which explains that S3 has 99.999999999% durability, I would discard your concern #1. Seriously.

Now, if #2 is a valid use case and a real concern for you, I would definitely stick with options #1 or #3. Which one of them? It really depends on some questions:

  • Do you need any other of the versioning features or is it only to avoid accidental overwrites/deletes?
  • Is the extra cost imposed by versioning affordable?
  • Amazon Glacier is optimized for data that is infrequently accessed and for which retrieval times of several hours are suitable. Is this OK for you?

Unless your storage use is really huge, I would stick with bucket versioning. This way, you won't need any extra code/workflow to backup data to Glacier, to other buckets, or even to any other server (which is really a bad choice IMHO, please forget about it).

Gray
  • 115,027
  • 24
  • 293
  • 354
Viccari
  • 9,029
  • 4
  • 43
  • 77
  • 4
    @SergeyAlekseev If Glacier is something that will work for you, it's very quick to set up a life cycle rule on a bucket that automagically archives your files to glacier. They'll still appear in a bucket (in the web UI) but the storage class will change from standard to glacier. I move processed files from my main bucket to a "done" bucket, and the done bucket has the life cycle rule on it that archives anything greater than 1 day old. These are data files that I probably will never touch again, but need to keep for the client. – Dan Jul 25 '13 at 20:55
  • 36
    I don't think 99.999999999% is a good reason enough to be full aws stack on storage/backup. I'm not talking about the 0.0000000001% left, but more if something highly unexpected occurs, it feels awkward to have the whole of your business lying somewhere. By unexpected, it could be US going to war to a specific country, Amazon being completely hacked (cf. Sony), etc. etc. – Augustin Riedinger Mar 04 '15 at 08:03
  • 13
    I will back @AugustinRiedinger on this one: "S3 issue" can be by definition something that you don't know (eg governmental issues) which could invalidate the hypotheses on which S3 SLA numbers like 99.99... are based on. When doing anything long-term including backing up your data, **diversification** is a good practice, if not should be a prerequisite – lajarre Mar 04 '15 at 08:50
  • 2
    I definitely agree that your points are valid. But based on the options given by the OP (pretty much all of them including AWS alternatives to the problem), I don't think "S3 issue" would be as broad as you guys are expanding. Good to see some broader thoughts, though. – Viccari Mar 04 '15 at 16:07
  • 1
    Great and valid suggestions for backup options. However, as Viccari pointed out in his answer, concern #1 should not be valid because of high durability. To avoid deleting files (concern #2), you should configure AWS Identity and Access management properly so that users don't have permissions to delete anything important. For being afraid of something accidentally gets deleted, the solution is not that you duplicate data, but rather to protect data from accidental deletion IMO. – user2124655 Feb 17 '16 at 14:40
  • 4
    Old answer, but I feel as if I need to mention recent(-ish) events. "The day Amazon broke the web", a tech accidently deleted a *huge* portion of their S3 servers. Even during those 24 hours, the problem was accessibility. Not data loss. There was absolutely no data loss, even given the large amount of servers being removed, and they still managed to come well within their SLA – Oberst Jul 05 '17 at 15:56
  • 1
    #1 is still a potential issue ("s3 issue"). I have seen interactions with S3 fail due to bugs in tools - for example, I've seen the aws cli tool fail to copy across all appropriate files with a 'sync' command. This fits in with the spirit of fault #1 which is really "An AWS-side issue beyond my control" than an issue specifically with S3 durability. – vacri Mar 19 '19 at 23:05
  • @Viccari [AWS say](https://aws.amazon.com/s3/faqs/#How_durable_is_Amazon_S3): "As with any environment, the best practice is to have a backup and to put in place safeguards against malicious or accidental deletion. For S3 data, that best practice includes secure access permissions, Cross-Region Replication, versioning, and a functioning, regularly tested backup." The most important part is: "... have a backup **and** to put in place ...". To me it means that you are supposed to have a backup **and** you could also put in place ... Bottom line: I would *not* discard concern #1. – rsl Sep 12 '19 at 14:29
  • @rsl Absolutely, but the recommendation you just mentioned is about "malicious or accidental deletion". That does not sound like an "S3 issue", like the original question states it, which is what my answer is aimed at. – Viccari Sep 14 '19 at 00:05
  • @Viccari I see your point. It depends if you believe that there is a 0% probability that S3 might accidentally delete some of your objects. That would definitely make it a "S3 issue" and you would be a very sad man since you were warned (see my previous comment). Anyway, I keep commenting on this for one thing - from your original answer I feel that you do not encourage to backup (aka versioning is enough). I am convinced about the opposite and AWS say the same thing. – rsl Sep 15 '19 at 20:30
  • Sorry if it came across that way. I think backup is needed for several reasons, but I would not think that an S3 issue is anywhere near the top of the list of reasons. Also, this answer is now 6 years old, maybe time proved me wrong? :) – Viccari Sep 16 '19 at 15:13
16

How about using the readily available Cross Region Replication feature on the S3 buckets itself? Here are some useful articles about the feature

Adrian Teh
  • 1,929
  • 18
  • 15
  • What if you delete a file in one region shouldn't be replicated in the other one? – michelem May 03 '20 at 06:30
  • S3 does not replicate deletions, check out this link https://docs.aws.amazon.com/AmazonS3/latest/dev/replication-what-is-isnot-replicated.html. – ᐅdevrimbaris May 21 '20 at 11:42
15

You can backup your S3 data using the following methods

  1. Schedule backup process using AWS datapipeline ,it can be done in 2 ways mentioned below:

    a. Using copyActivity of datapipeline using which you can copy from one s3 bucket to another s3 bucket.

    b. Using ShellActivity of datapipeline and "S3distcp" commands to do the recursive copy of recursive s3 folders from bucket to another (in parallel).

  2. Use versioning inside the S3 bucket to maintain different version of data

  3. Use glacier for backup your data ( use it when you don't need to restore the backup fast to the original buckets(it take some time to get back the data from glacier as data is stored in compressed format) or when you want to save some cost by avoiding to use another s3 bucket fro backup), this option can easily be set using the lifecycle rule on the s3 bucket fro which you want to take backup.

Option 1 can give you more security let say in case you accidentally delete your original s3 bucket and another benefit is that you can store your backup in datewise folders in another s3 bucket, this way you know what data you had on a particular date and can restore a specific date backup . It all depends on you use case.

Gray
  • 115,027
  • 24
  • 293
  • 354
Varun
  • 1,159
  • 1
  • 14
  • 19
  • @David : As david suggested in his solution below ,that there could be a script that backs up s3 bucket daily or weekly,This can be easily attained by my first point (AWS datapipeline- which gives you ability to schedule the backup process -daily, weekly etc.). I would recommend to have a lookup on aws datapipeline. – Varun Jan 12 '15 at 17:30
  • This shows some promise, because it doesn't rely on outmoded approaches that do not excel at making the most of the cloud (read: crons). Data Pipeline also has automated retries, and is a managed (serverless) service. – Felipe Alvarez Jun 01 '20 at 03:48
  • This kind of backup won't help if S3 becomes unavailable. It is uncommon, but it happens. – fjsj Jun 25 '21 at 12:18
10

You'd think there would be an easier way by now to just hold some sort of incremental backups on a diff region.

All the suggestions above are not really simple or elegant solutions. I don't really consider glacier an option as I think thats more of an archival solution then a backup solution. When I think backup I think disaster recovery from a junior developer recursively deleting a bucket or perhaps an exploit or bug in your app that deletes stuff from s3.

To me, the best solution would be a script that just backs up one bucket to another region, one daily and one weekly so that if something terrible happens you can just switch regions. I don't have a setup like this, I've looked into just haven't gotten around to doing it cause it would take a bit of effort to do this which is why I wish there was some stock solution to use.

David
  • 9,799
  • 5
  • 28
  • 32
  • 1
    Agreed. It's interesting when you dig into S3 (even CRR - built in replication) there are big holes for disaster recovery. You can't, for example, ever restore a bucket, the file version histories, the metadata (esp last modified dates) etc. All recovery scenarios currently available are partial recoveries. – Paul Jowett Mar 18 '19 at 06:16
8

While this question was posted some time ago, I thought it important to mention MFA delete protection with the other solutions. The OP is trying to solve for the accidental deletion of data. Multi-factor authentication (MFA) manifests in two different scenarios here -

  1. Permanently deleting object versions - Enable MFA delete on the bucket's versioning.

  2. Accidentally deleting the bucket itself - Set up a bucket policy denying delete without MFA authentication.

Couple with cross-region replication and versioning to reduce the risk of data loss and improve the recovery scenarios.

Here is a blog post on this topic with more detail.

user1590603
  • 149
  • 2
  • 5
1

As this topic was created longtime ago and is still pretty actual, here some updated news:

External backup

Nothing changed, you still can use CLI, or any other tool to schedule a copy somewhere else (in or out of AWS).

There is tools to do that and previous answers were very specific

"Inside" backup

S3 now supports versionning for previous versions. It means that you can create and use a bucket normally and let S3 manage the lifecycle in the same bucket.

An example of possible config, if you delete a file, would be:

  1. File marked as deleted (still available but "invisible" to normal operations)
  2. File moved to Glacier after 7 days
  3. File removed after 30 days

You first need to activate versionning, and go to Lifecycle configuration. Pretty straight forward: previous versions only, and deletion is what you want. S3 Lifecyle panel

Then, define your policy. You can add as many actions as you want (but each transition cost you). You can't store in Glacier less than 30 days. S3 Lifecycle actions panel

Clément Duveau
  • 399
  • 3
  • 14
0

If, We have too much data. If you have already a bucket then the first time The sync will take too much time, In my case, I had 400GB. It took 3hr the first time. So I think we can make the replica is a good solution for S3 Bucket backup.

Ankit Kumar Rajpoot
  • 5,188
  • 2
  • 38
  • 32
  • I'm about to move about 7TBs into a bucket and am trying to figure out the best option... I'm thinking I need something better than sync. I'm wondering if using a pipeline to copy data to GCS version of glacier might offer the best overall safety? – Brendon Whateley May 01 '20 at 17:06
  • 1
    AWS DataSync could be an option here. – Felipe Alvarez Jun 01 '20 at 03:49