How to parallelize curl for a large file via AWS EMR

Question

I am trying to pull a very large file (>1Tb) from the web into AWS S3. Normally I'd use Requests + multipart upload to do this, but given the size of the file this ends up being extremely slow. In trying to find an alternative, I see that it is pretty fast and not too resource-demanding to use the command discussed here to accomplish this:

curl "https://download-link-address/" | aws s3 cp - s3://aws-bucket/data-file

Yet, this is still pretty slow, even when run from an EC2 instance, taking several weeks to finish. I'm very new to EMR and to be honest am still trying to wrap my head around how it works, but it would seem a natural idea to try and use a cluster to parallelize this task with it. My idea would be to include a range header in the command which specifies just a part of the file and then somehow combine it together in S3 (or use a multipart upload in the command, if it exists).

But I don't know how to set up a cluster to do this, especially in automatically providing the current values for the range when it's time for a new chunk to be pulled. So my question is -- is there a relatively simple way to do this? Or, alternatively, is this even the right approach?

Questions should be self contained. Don't ask readers to goto other pages and seive thru to guess what you found important. Please edit your question to show what you found and **what you tried** and provide evidence of how it is/isn't working. Good luck! — shellter, Apr 28 '23 at 22:51
I've used a simple script [like this](https://gist.github.com/Q726kbXuN/595aef5bcc33579bf1874f0bfcac57b7) in the past to create some jobs in SQS and then I can have _x_ workers building the multi-part upload, spread over how many ever nodes are available. — Anon Coward, Apr 29 '23 at 00:17
@AnonCoward This approach seems like it would be effective, thanks! Did you use EMR for this? And if so, how would I facilitate the nodes running the script? I'm assuming this would be a matter of setting up a "prepare", "worker", and "complete" step, calling this script and using the phase as an argument. Would the cluster automatically distribute the "worker" phase? Sorry if these are dumb questions; as I say I am very new to this type of workflow so the act of having a cluster parallelize/distribute work is still a bit opaque to me. — Doug MacArthur, Apr 30 '23 at 21:25
Sorry, no. The times I've needed to run this, the prepare/complete steps were run manually, since they only need to be run once per file, and I needed to transfer one or two files. The worker steps were run via automation, but not using EMR in this instance. — Anon Coward, May 01 '23 at 13:50

How to parallelize curl for a large file via AWS EMR

0 Answers0