Comparing checksums of tarball archive with original directory

Question

I'm wondering how to verify the checksum of a tarball backup with the original directory after creation.

Is it possible to do so without extracting it for example if it's a large 20GB backup?

Example, a directory with two files:

mkdir test &&
echo "one" > test/one.txt &&
echo "two" > test/two.txt

Get checksum of directory:

find test/ -type f -print0 | sort -z | xargs -0 shasum | shasum

Resulting checksum of directory content:

d191c793cacc4bec1f070eb96fa68524cca566f8  -

Create tarball:

tar -czf test.tar.gz test/

The checksum of the directory content stays constant.

But when creating the archive and getting the checksum of the archive I noticed that the results vary. Why is that?

How would I go about getting the checksum of the tarball content to compare to the directory content checksum?

Or what's a better solution to check that the archive contains all the necessary content from the original directory (without extracting it if it's large)?

Not an answer, but possible explanations of why the `tar.gz` checksum is different each time: `tar` might have collected the files in a different order from one time to the next, leading to different content, over-all (whereas you sorted your filenames before shasum-ing), `tar` includes the modification time (https://docs.fileformat.com/compression/tar/), so it's possible a file was touched but not modified, and finally `gzip` includes a timestamp (https://docs.fileformat.com/compression/gz/) which will differ each time. — Erwin, Jun 20 '22 at 21:38

score 0 · Answer 1 · answered Jun 20 '22 at 23:51

Your directory checksum is calculating the SHA-1 of each file's contents. You would need to read and decompress the entire tar archive to do the same calculation. That doesn't mean you'd need to save the contents of the archive anywhere. You'd just need to read it sequentially into memory, and do the calculation there.

Comparing checksums of tarball archive with original directory

1 Answers1