12

As part of our project, we have created quite a bushy folder/file tree on S3 with all the files taking up about 6TB of data. We currently have no backup of this data which is bad. We want to do periodic back ups. Seems like Glacier is the way to go.

The question is: what are the ways to keep the total cost of a back up down?

Most of our files are text so we can compresses them and upload whole ZIP archives. This will require processing (on EC2) so I am curious whether there is any rule of thumb to compare extra cost of running an EC2 instance for zipping versus just uploading uncompressed files.

Also, we would have to pay for data transfer so I am wondering if there is any way of backing up other than (i) download file from S3 to an instance; (ii) upload file in its raw form or zipped up to Glacier.

I Z
  • 5,719
  • 19
  • 53
  • 100
  • We finally got tired of dealing with the long latency to restore from Glacier which is typically 3-5 hours, and the hidden cost factors. We ended up creating a program to synchronize and create snapshots of my buckets, amongst other things, using S3 Reduced Redundancy Storage to better approximate the cost savings benefits of Glacier. It has worked well for us for the past few years so we ended up turning it into a commercial product. You can try a full featured 2 week trial version absolutely free at [BucketBacker](https://www.bucketbacker.com/) – Krafty Mar 14 '15 at 14:57

2 Answers2

23

I generally think of Glacier as an alternative storage to S3, not an additional storage. I.e., data would most often be stored either in S3 or Glacier, but rarely both.

If you trust S3's advertised eleven nines of durability, then you're not backing up because S3 itself is likely to lose the data.

You might want to back up the data because (like I do) you see your Amazon account as a single point of failure (e.g., credentials are compromised or Amazon blocks your account because they believe you are doing something abusive). However, in that case, Glacier is not a sufficient backup as it still falls under the Amazon umbrella.

I recommend backing up S3 data outside of Amazon if you are concerned about losing the data in S3 due to user error, compromised credentials, and the like.

I recommend using Glacier as a place to archive data for long term, cheap storage when you know you're not going to need to access it much, if ever. When things are transitioned to Glacier, you would then delete them from S3.

Amazon provides automatic archival from S3 to Glacier which works great, but beware of the extra costs if the average size of your files is small. Here's an article I wrote on that danger:

Cost of Transitioning S3 Objects to Glacier
http://alestic.com/2012/12/s3-glacier-costs

If you still want to copy from S3 to Glacier, here are some points related to your questions:

  • You will presumably leave the data in Glacier a long time, so compressing it is probably worth the short term CPU usage. The exact trade off depends on factors like the compressibility of your data, how long it takes to compress, and how often you need to perform the compression.

  • There is no charge for downloading data from S3 to an EC2 instance. There is no data transfer charge for uploading data into Glacier.

  • If you upload many small files to Glacier, the upload per item charges can add up. You can save on cost by combining many small files into an archive and uploading it.

Another S3 feature that can help protect against accidental loss through user error or attacks is to turn on S3 versioning and enable MFA (multi-factor authentication). This prevents anybody from being able to permanently delete objects unless they have the credentials plus a physical device in your possession.

Eric Hammond
  • 22,089
  • 5
  • 66
  • 75
  • 3
    Eric, thanks for the detailed answer. The main reason why I want to do a backup is that right now any member of our team -- that includes graduate students, professors, professional software developers and other people -- can accidentally delete an entire subtree of data on S3 with one wrong move ("delete folder"). But maybe the answer to that is to do something like what you've describe at the bottom of your reply. – I Z Mar 06 '13 at 14:31
  • 7
    Just like IZ, I feel that by far the most likely cause of catastrophic loss of data is a mistake on my part. Accidentally deleting a bucket, or running a script that does the same. Having a copy in glacier provides a safety net in this case. – Micah Apr 26 '13 at 14:17
  • I'm interested in the very same thing and am currently talking to Amazon about options. There is an easy way to move data (archive) from S3 to Glacier but not copy data (backup). If you are dealing with hundreds of TB of data, downloading it all to an ec2 instance then uploading it to Glacier would probably take so long and cost so much that it wouldn't be worth doing. S3 really needs a path for backing up large amounts of data. It's too dangerous not to have it in a production system that customers are paying for. – d512 Aug 07 '13 at 22:38
  • 1
    It can also be shockingly expensive to pull a lot of data out of glacier. The formula they use is fairly convoluted, but pulling out 100 TB of data over the course of 4 hours will cost you $190,000! (you read that right, one hundred ninety thousand dollars). See here http://aws.amazon.com/glacier/faqs/#How_will_I_be_charged_when_retrieving_large_amounts_of_data_from_Amazon_Glacier and here http://liangzan.net/aws-glacier-calculator/ and here http://calculator.s3.amazonaws.com/calc5.html – d512 Aug 08 '13 at 17:38
  • S3 Versioning... Nice! Just what I was looking for. – Rafael Oliveira Jan 30 '14 at 22:02
  • "There is no charge for downloading data from S3 to an EC2 instance." -- you meant other than the $0.004 per 10,000 GET requests, right? – Josh Kupershmidt Jun 27 '14 at 19:33
1

I initially addressed the same issue in my S3 buckets I wanted to back up by doing the following:

  1. create a second "mirror" bucket for each S3 bucket I want to backup to Glacier
  2. launch a micro Ubuntu server instance for running cron jobs
  3. install s3cmd on the server
  4. write a shell script to sync all objects from each bucket to the mirror bucket
  5. enable Lifecycle rules on the mirror bucket to change the status of each object to "Glacier"

This works just fine, but I decided for my purposes that it was easier to just enable Versioning on my bucket. This ensures that if an object is accidentally deleted or updated, it can be recovered. The drawback to this approach is that the process of restoring an entire branch or sub-tree might be time consuming. But it is easier, more cost effective, and adequate for protecting the contents of the bucket from permanent destruction.

Hope that helps someone down the road.

Todd Price
  • 2,650
  • 1
  • 18
  • 26
  • 1
    Does versioning protect from accidentally deleting an entire "folder" using AWS Management Console UI? I tried deleting a folder in a versioned bucket, and I don't see any way to restore it. – Turar May 28 '14 at 22:17