2

How do you create a tarball so that its md5 or sha512 hash will be deterministic?

I'm currently creating a tarball of a directory of source code files by running tar --exclude-vcs --create --verbose --dereference --gzip --file mycode.tgz *, and I'd like to record its hash so I can use it as a fingerprint for detecting changes in the future.

However, I've noticed that if I create duplicate tarballs without changing any files, running the Python hashlib.sha512(open('mycode.tgz').read()).hexdigest() on each archive returns a different hash.

Is this because tar's compression algorithm is not deterministic? If so, how can I efficiently archive a large collection of files in such a way that I can calculate a consistent hash to detect changes?

Cerin
  • 60,957
  • 96
  • 316
  • 522
  • `tar` doesn't have a compression algorithm - it's the `--gzip` option that applies gzip compression to the tarball. Would be interesting to know whether the problem persists without `--gzip` (and switching then to a plain `.tar` extension). Then at least you'd find out whether the differences are coming from `tar` or from `gzip`. – Tim Peters Dec 06 '13 at 01:26
  • As I said below, tar includes a modification date in its header, so it's likely at least caused by that. – Jesse Rusak Dec 06 '13 at 01:28

4 Answers4

1

It's probably possible to generate a version of tar that produces deterministic hashes, but rather than doing that, most packaging systems that need tar hashes consistent use something like pristine-tar. Unfortunately, pristine-tar will not help for your use case.

However, the Git version control system is quite good at generating consistent hashs (sha-1 not sha-512) of a directory tree though.

git add .
git write-tree

will print a hash that is consistent except when something changes. File contents and mode changes will be tracked.

Sam Hartman
  • 6,210
  • 3
  • 23
  • 40
  • +1, though I believe tar files include a modification date in their headers, (at least according to wikipedia), so it seems like it would be awkward to generate a consistent tar file. – Jesse Rusak Dec 06 '13 at 01:22
  • @JesseRusak Yeah, I was thinking of a custom tar program that among other things chose most recent file time as header mod time. Possible but way too difficult for practicality – Sam Hartman Dec 06 '13 at 01:43
1

After finding this question, I realized that my archives are actually nearly identical, except for the first few bytes that contain a timestamp. Changing my code to hashlib.sha512(open(fn).read()[8:]).hexdigest() to strip off the first few characters fixed the problem.

Community
  • 1
  • 1
Cerin
  • 60,957
  • 96
  • 316
  • 522
1

Gnu tar can set the timestamps for consistent hashing.

tar --sort=name --owner=root:0 --group=root:0 --mtime='UTC 2019-01-01' ...

Credits: https://stackoverflow.com/a/54908072

Bryan Larsen
  • 9,468
  • 8
  • 56
  • 46
0

The shell glob might be shuffling the order of the files as they're added to the archive. Maybe try specifying the exact order with something like:

find . | sort | tar -T - --exclude-vcs --create --verbose --dereference --gzip --file mycode.tgz
user2926055
  • 1,963
  • 11
  • 10
  • 1
    Shell globbing doesn't shuffle the filenames, it expands to a sorted list, albeit that's depending on the active locale. – user2719058 Dec 05 '13 at 23:58
  • @user2719058 Good point. Perhaps tar, seeing only the top-level directories, is picking lower-level files/directories in a non-deterministic order. Or maybe I'm completely off the mark. – user2926055 Dec 06 '13 at 00:04