0

We are using a CI/CD pipeline to sync the files stored in a Git repo to an S3 bucket. aws s3 sync determines what files should be copied based on their file sizes and last modified timestamps.

However, every time when the Git repo is checked out in the pipeline, each file will get a new timestamp. This causes aws s3 sync to copy files that are probably unchanged to the S3 bucket again.

Another alternative is to run aws s3 sync with the --size-only option. The issue is this command will not be able to sync modified files that still have the same size as before.

Is it possible to use aws s3 sync or any alternative to sync modified files that are based on content changes only?

Ray Jasson
  • 422
  • 2
  • 9
  • 23
  • 1
    I don't think you can do that with this utility, though I think in most situations the size-only option is sufficient. Most changes will be at least a few bytes different. But not all, you're right. I think you'll either have to suffer the cost of a full transfer, or write another process to check with etags or local hash table. – theherk Feb 24 '23 at 10:54
  • 1
    I would not use size only, that's going to introduce problems at some point. Personally, I think you may be better off with the over-inclusive sync unless it's a very large amount of data or causes caching issues for you later. – jarmod Feb 24 '23 at 12:35

1 Answers1

1

AWS s3 sync doesn't have this option and only compares timestamps. You could do:

For the last point you could use the fact, that S3 stores hashes of the uploaded files as metadata and compare them if the file needs to be uploaded. You could also do a very "easy", custom solution like:

  • Upload all files to S3 (initial load)
  • Tag all the files with the commit hash
  • Delete files that have been changed since that revision
  • Run aws sync.

This would involve programming and I do not know of any tool that helps you here, because:

In the end build times will also matter, and calculating hashes (the only remaining options when you don't want to upload everything, and can't rely on size and times) also takes time and it might be easier to upload everything (depending on the size, things change if we talk about GBs).

There do exist much more efficient syncing tools that do not work with S3 (rsync, cdc-file-transfer, among others). So if GBs and transfer speed is a matter, you might need to ditch S3 and move to EFS.

Augunrik
  • 1,866
  • 1
  • 21
  • 28