25

When syncing data to an empty directory in S3 using AWS-CLI, it's almost instant. However, when syncing to a large directory (several million folders), it takes a very long time before even starting to upload / sync the files.

Is there an alternative method? It looks like it's trying to take account of all files in an S3 directory before syncing - I don't need that, and uploading the data without checking beforehand would be fine.

Community
  • 1
  • 1
King Dedede
  • 970
  • 1
  • 12
  • 28
  • 3
    That sounds like expected behavior. – Philip Kirkbride Jan 24 '17 at 18:37
  • Syncing 100mb to a new directory takes almost no time, but syncing to a heavily used directory can take hours - hopefully there is an alternative! – King Dedede Jan 24 '17 at 18:38
  • 1
    One alternative that works for me rclone (https://rclone.org). I didn't do exact benchmarks, but aws cli sync took hours to find the 30 files out of >5000 that had to be synced. rclone did the same in minutes. – mvtango Aug 23 '19 at 07:02
  • 1
    @PhilipKirkbride: I don't see why. Unless the OP is using `--delete`, the only files to consider / list are the local ones. – Pierre D Jan 30 '20 at 02:44
  • BTW, I wish `aws s3 [ls|cp|sync]` had options `--min min-key` and `--max max-key`. When we wrote java equivalents to these commands (many years ago), we made good use of S3 listing `Marker`. See a Python example of the same idea in https://stackoverflow.com/a/51372405/758174. – Pierre D Jan 30 '20 at 02:48
  • @PierreD just pointing out that is expected as is confirmed by excepted answer, all files in bucket are enumerated. – Philip Kirkbride Jan 30 '20 at 06:09
  • 1
    @PhilipKirkbride: what I mean is that, to me, it is *unexpected* given that: 1. this is clearly avoidable and suboptimal, and 2. usually `awscli` is well implemented and fast. In other words, I don't contest the fact that the current implementation of `aws s3 sync` is slow in this case, but I am _surprised_ by it. You make it sound like it is _logical_, which it is not. – Pierre D Jan 31 '20 at 18:47
  • @PierreD yes good point, hopefully they will update this. – Philip Kirkbride Jan 31 '20 at 23:28
  • if you don't need md5 checks of every file, you can use the `--size-only` switch per [this answer](https://stackoverflow.com/a/42787035/3281039) – user108569 Mar 16 '22 at 16:26

2 Answers2

29

The sync command will need to enumerate all of the files in the bucket to determine whether a local file already exists in the bucket and if it is the same as the local file. The more documents you have in the bucket, the longer it's going to take.

If you don't need this sync behavior just use a recursive copy command like:

aws s3 cp --recursive . s3://mybucket/

and this should copy all of the local files in the current directory to the bucket in S3.

garnaat
  • 44,310
  • 7
  • 123
  • 103
  • 11
    Danger! Using `aws s3 cp` could end up being expensive as you'll be uploading your files over and over if you run this copy multiple times. A better solution would likely be to keep using `aws s3 sync` but increase the `max-concurrent-requests` setting: https://docs.aws.amazon.com/cli/latest/topic/s3-config.html#max-concurrent-requests – Firefishy Aug 16 '20 at 21:31
6

If you use the unofficial s3cmd from S3 Tools, you can use the --no-check-md5 option while using sync to disable the MD5 sums comparison to significantly speed up the process.

--no-check-md5        Do not check MD5 sums when comparing files for [sync].
                        Only size will be compared. May significantly speed up
                        transfer but may also miss some changed files.

Source: https://s3tools.org/usage

Example: s3cmd --no-check-md5 sync /directory/to/sync s3://mys3bucket/

Tamlyn
  • 22,122
  • 12
  • 111
  • 127
spoonsearch
  • 123
  • 3
  • 9