I am trying to pull a very large file (>1Tb) from the web into AWS S3. Normally I'd use Requests + multipart upload to do this, but given the size of the file this ends up being extremely slow. In trying to find an alternative, I see that it is pretty fast and not too resource-demanding to use the command discussed here to accomplish this:
curl "https://download-link-address/" | aws s3 cp - s3://aws-bucket/data-file
Yet, this is still pretty slow, even when run from an EC2 instance, taking several weeks to finish. I'm very new to EMR and to be honest am still trying to wrap my head around how it works, but it would seem a natural idea to try and use a cluster to parallelize this task with it. My idea would be to include a range header in the command which specifies just a part of the file and then somehow combine it together in S3 (or use a multipart upload in the command, if it exists).
But I don't know how to set up a cluster to do this, especially in automatically providing the current values for the range when it's time for a new chunk to be pulled. So my question is -- is there a relatively simple way to do this? Or, alternatively, is this even the right approach?