Using AWS EC2 to Place a Large Number of Files into a S3 Bucket

Question

I am trying to download a large number of files (~50 terabytes) into a S3 bucket. The problem is that these files are only accessible through various download links located on a website (they aren't already on my hard drive). I could just download a small portion of the data directly onto the hard drive of my own computer, upload it to the S3 bucket, delete it from my hard drive, and repeat with another portion, but I'm worried that doing so will take far too long and use too much bandwidth. Instead, I was hoping I could use an EC2 instance to do the same thing, as the answerer of this question suggested, but I'm having trouble how I would go about doing this with Java.

With Java, requesting and starting EC2 instances seems pretty clear; however, actually using the instance gets kind of blurry. I understand that you can use the EC2 Management Console to connect to an instance directly, and I could just manually run a script while connected to the instance that would download and upload the files, though I would prefer running a script from my computer that creates the EC2 instance and then uses the instance to accomplish my goal. This is because later on in my project, I will be downloading a file from the same website daily, and using the Windows Scheduled Task manager on my computer to run a script is cheaper than leaving the EC2 instance running 24/7 and doing it daily there.

Simply put, how do I use Java to use an EC2 instance?

If you read the question you linked, you should have noted nowadays they would suggest using Amazon Lambda for this purpose, instead of EC2. Also, isn't that website you want to download various files through links yours? If it's not yours, you may have other problems. If they have a good firewall, they might detect you are being abusive by downloading a lot of files from their server, and temporarily block your IP. — Alisson Reinaldo Silva, Jul 24 '17 at 01:01

score 1 · Answer 1 · answered Jul 23 '17 at 22:52

I'd first point out that downloading/uploading 50TB of data will take a very long time...

Option 1 - do it with Java

What you are doing could be done via the AWS Java SDK. You would need to develop the app which downloads the required files and then uploads these to your S3 bucket using the SDK.

I'd recommend against this approach as you're going to be paying twice for the bandwidth going first to the EC2 instance and then your S3 bucket. Plus there are simpler ways...

Option 2 - do it with Lambda

As suggested in the answer you linked, use the AWS Lambda service to upload the remote files to your S3 bucket. You can write Java, NodeJS etc. Will reduce your bandwidth costs and also mean you don't need to spin up and deploy to any EC2 instances.

Other points

In terms of having something running from your local machine and processing on a daily basis, I would look at solving this once you've done your initial upload. Trying to solve the two problems together will likely cause you headaches.

Lastly, another option may be the AWS Snowball service. They will ship you one or more physical devices which you fill and send back. May not fit your use-case but worth mentioning.

Word of warning - with 50 TB of data, be mindful of the bandwidth charges you will incur downloading & uploading.

Alright, I'll look into using AWS Lambda instead. I had glanced at it before, but it seemed like EC2 was the way to go. Just to clarify, wouldn't using EC2 remove the bandwidth problem, as the files never get downloaded to my computer? Everything goes through Amazon's network, not mine, right? Also, couldn't launching multiple instances, each downloading and uploading a different potion of the data, save some time? Still new to the cloud-based stuff, just trying to learn. Thanks for the the help! — Jack Day, Jul 23 '17 at 23:10
AWS Data Transfer charges only apply to data being sent from the AWS Cloud to the Internet. The owner of the remote site will pay for any bandwidth for traffic exiting their site (is it on AWS?). In theory, none of your requirements would incur any AWS Data Transfer charges, since you will only have data going *to* AWS or moving *within* AWS. — John Rotenstein, Jul 23 '17 at 23:56

score 1 · Accepted Answer · answered Jul 24 '17 at 00:14

There are two distinct phases that your solution would require:

Obtain a list of files to download
Download the files

I would recommend separating these two tasks because an error in the logic for listing the files could stop the download process mid-way, making it difficult to resume once the problem is corrected.

Listing the files is probably best done on your local computer, which would be easy to debug and track progress. The result would be a text file with lots of links. (This is similar in concept to a lot of scraper utilities.)

The second portion (downloading the files) could be done on either Amazon EC2 or via AWS Lambda functions.

Using Amazon EC2

This would be a straight-forward app that reads your text file, loops through the links and downloads the files. If this is a one-off requirement, I wouldn't invest too much time getting fancy with multi-threading your app. However, this means you won't be taking full advantage of the network bandwidth, and Amazon EC2 is charged per hour.

Therefore, I would recommending using fairly small instance types (each with limited network bandwidth that you can saturate), but running multiple instances in parallel, each with a portion of your list of text files. This way you can divide and conquer.

If something goes wrong mid-way, you can always tweak the code, manually edit the text file to remove the entries already completed, then continue. This is fairly quick-and-dirty, but fine if this is just a one-off requirement.

Additionally, I would recommend using Amazon EC2 Spot Instances, which can save up to 90% of the cost of Amazon EC2. There is a risk of an instance being terminated if the Spot Price rises, which would cause you some extra work to determine where to resume, so simply bid a price equal to the normal On-Demand price and it will be unlikely (but not guaranteed) that your instances won't be terminated.

Using AWS Lambda functions

Each AWS Lambda function can only run for a maximum of 5 minutes and can only store 500MB of data locally. Fortunately, functions can be run in parallel.

Therefore, to use AWS Lambda, you would need to write a controlling app that calls an AWS Lambda function for each file in your list. If any of the files exceed 500MB, this would need special handling.

Writing, debugging and monitoring a parallel, distributed application like this probably isn't worth the effort for a one-off task. It would be much harder to debug any problems and recover from errors. (It would, however, be an ideal way to do continuous downloads if you have a continuing business need for this process.)

Bottom line: I would recommend writing and debugging the downloader app on your local computer (with a small list of test files), then using multiple Amazon EC2 Spot Instances running in parallel to download the files and upload them to Amazon S3. Start with one instance and a small list to test the setup, then go parallel with bigger lists. Have fun!

Okay this is quite helpful! Still, how do I actually use the EC2 instance(s)? I know how to request/start a spot instance, and I can easily use Java to download a file to my own computer, but how do I alter that process so it instead downloads to the EC2 instance? — Jack Day, Jul 24 '17 at 00:33
An Amazon EC2 instance is just a computer like any other (admittedly, a virtual computer, so it's more like VMware). Just use it like you would normally use a computer! It would be easier to write and debug your code on your own computer, then copy the resulting app to the EC2 instance and run it from there. — John Rotenstein, Jul 24 '17 at 00:35
Ah, so rather than using my computer to both start an instance AND use that instance to complete a task, I can only use my computer to start the instance, and then the instance itself will be completing the task? I was under the impression that code ran from my computer could download a file onto an instance instead of the instance downloading the file itself, if that makes any sense. This cleared up a lot of my confusion, thank you! — Jack Day, Jul 24 '17 at 01:11
Correct. It is difficult to "push" a file to a computer (eg how would somebody send a file to your computer?), but very easy to run code on the instance to download data from the remote and then upload it to S3. — John Rotenstein, Jul 24 '17 at 01:47
@JackDay *"I was under the impression that code ran from my computer could download a file onto an instance instead of the instance downloading the file itself"* The entire purpose of the exercise is for code running on the instance to download the file itself. Code running on your computer would have to download the file to your computer, first, then send it to the EC2 instance... which is exactly the part you are trying to avoid, pulling and pushing all those bytes through your local connection. — Michael - sqlbot, Jul 24 '17 at 02:14

Using AWS EC2 to Place a Large Number of Files into a S3 Bucket

2 Answers2