20

I was downloading a file using awscli:

$ aws s3 cp s3://mybucket/myfile myfile

But the download was interrupted (computer went to sleep). How can I continue the download? S3 supports the Range header, but awscli s3 cp doesn't let me specify it.

The file is not publicly accessible so I can't use curl to specify the header manually.

hraban
  • 1,819
  • 1
  • 17
  • 27

2 Answers2

19

There is a "hidden" command in the awscli tool which allows lower level access to S3: s3api.† It is less user friendly (no s3:// URLs and no progress bar) but it does support the range specifier on get-object:

   --range  (string) Downloads the specified range bytes of an object. For
   more   information   about   the   HTTP    range    header,    go    to
   http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.35.

Here's how to continue the download:

$ size=$(stat -f%z myfile) # assumes OS X. Change for your OS
$ aws s3api get-object \
            --bucket mybucket \
            --key myfile \
            --range "bytes=$size-" \
            /dev/fd/3 3>>myfile

You can use pv for a rudimentary progress bar:

$ aws s3api get-object \
            --bucket mybucket \
            --key myfile \
            --range "bytes=$size-" \
            /dev/fd/3 3>&1 >&2 | pv >> myfile

(The reason for this unnamed pipe rigmarole is that s3api writes a debug message to stdout at the end of the operation, polluting your file. This solution rebinds stdout to stderr and frees up the pipe for regular file contents through an alias. The version without pv could technically write to stderr (/dev/fd/2 and 2>), but if an error occurs s3api writes to stderr, which would then get appended to your file. Thus, it is safer to use a dedicated pipe there, as well.)

† In git speak, s3 is porcelain, and s3api is plumbing.

hraban
  • 1,819
  • 1
  • 17
  • 27
  • 1
    Using `/dev/stdout` in this case is wrong as the command itself (`aws s3api get-object`) writes to `stdout`. One has to sacrifice the conveniency of `pv` and write simply: `size=$(stat --printf="%s" myfile); aws s3api get-object --bucket mybucket --key myfile --range "bytes=$size-" myfile.part ; cat myfile.part >> myfile` – random Oct 23 '18 at 10:51
  • If I don't specify outfile, I get: `aws: error: the following arguments are required: outfile`. Version 1.16.30. "Cosmic rays, man" :) – hraban Oct 23 '18 at 20:08
  • 1
    Yes you do need to specify outfile -- it's `myfile.part` above. Apologies, I posted multiline code in one line, can't do it otherwise in comment apparently. – random Oct 24 '18 at 21:12
  • Ah, I get your point now! Sorry, my bad. Thanks for that, I'll edit the post :) – hraban Oct 31 '18 at 11:31
  • I've updated the command. Slightly adapted to work without temporary file, because this is subject is about large files and it lets us keep pv. Thanks for pointing out the bug! – hraban Nov 01 '18 at 01:51
  • 1
    @hraban I couldn't get `3>&1 >&2 >> myfile` to work (it printed the file to stdout), but I found that `/dev/fd/3 3>>myfile` worked well under OS X for me (file goes to myfile, stdout goes to screen) – nonagon Jan 14 '19 at 02:47
  • @nonagon thanks for that! you're right, I don't know what I was thinking. that only works with the `pv` in between. I'll update. -- btw, be careful to `>&2` in your solution as well, otherwise you go back to the problem of adding noise to the tail of your file (as was originally flagged by @random ) – hraban Jan 14 '19 at 10:36
  • @hraban no problem, I'm happy to help. This is such an important thing to be able to do, and your answer is the only place I was able to find it. Regarding `>&2` is that really necessary? I'm terrible at bash stuff but if it's writing to fd #3 and we're redirecting #3 to a file, does it matter whether stdout and stderr are writing to the screen? I tested my answer as it is (just `/dev/fd/3 3>>myfile`) and it seems to work without corrupting the file. – nonagon Jan 14 '19 at 15:30
  • @nonagon fair point! if you just do it without `pv`, the `>&2` isn't actually necessary. That option has been in want of love because I never use it. But yes; you're definitely right. :) – hraban Jan 14 '19 at 22:03
10

Use s3cmd it has a --continue function built in. Example:

# Start a download
> s3cmd get s3://yourbucket/yourfile ./
download: 's3://yourbucket/yourfile' -> './yourfile' [1 of 1]
    123456789 of 987654321     12.5% in 235s   0.5 MB/s

[ctrl-c] interrupt

# Pick up where you left off
> s3cmd --continue get s3://yourbucket/yourfile ./

Note that S3 cmd is not multithreaded where awscli is multithreaded, e.g. awscli is faster. A currently maintained fork of s3cmd, called s4cmd appears to provide the multi-threaded capabilities while maintaining the usability features of s3cmd:

https://github.com/bloomreach/s4cmd

David Parks
  • 30,789
  • 47
  • 185
  • 328