how to create archive whose keep same md5 hash for identical content in Python?

Question

As explain in this article https://medium.com/@mpreziuso/is-gzip-deterministic-26c81bfd0a49 the md5 of two .tar.gz files that are the compression of the exact same set of files can be different. This is because it, for example, includes timestamp in the header of the compressed file.

In the article 3 solutions are proposed, and I would ideally like to use the first one which is:

We can use the -n flag in gzip which will make gzip omit the timestamp and the file name from the file header;

And this solution works well:

tar -c ./bin |gzip -n >one.tar.gz
tar -c ./bin |gzip -n >two.tar.gz
md5sum one.tgz two.tgz

Nevertheless I have no idea of what will be a good way to do it in Python. Is there a way to do it with tarfile(https://docs.python.org/2/library/tarfile.html)?

Is there some reason why you can't launch the commands that you just wrote as an external process? `os.system("tar -c ./bin |gzip -n >one.tar.gz")` — Martin Drozdik, Jul 11 '17 at 13:38
What's wrong with using an explicit `mtime` argument to `gzip.GzipFile()`? — Ignacio Vazquez-Abrams, Jul 11 '17 at 14:11
To reply to comments... leaving aside that the above example is oversimplified and one might be doing something with `tarfile` that is much less trivially converted to shell commands... `tar` ***IS NOT PORTABLE***. Maybe, like me, the OP is using `tarfile` for portability reasons. And having to manually construct a `GzipFile` vs. using `:gz` is a pain . — Matthew, Nov 15 '21 at 19:55

score 7 · Answer 1 · answered Sep 15 '19 at 03:44

Martin's answer is correct, but in my case I wanted to ignore the last modified date of each file in the tar as well, so that even if a file was "modified" but with no actual changes, it still has the same hash.

When creating the tar, I can override values I don't care about so they are always the same.

In this example I show that just using a normal tar.bz2, if I re-create my source file with a new creation timestamp, the hash will change (1 and 2 are the same, after re-creation, 4 will differ). However, if I set the time to Unix Epoch 0 (or any other arbitrary time), my files will all hash the same (3, 5 and 6)

To do this you need to pass a filter function to tar.add(DIR, filter=tarInfoStripFileAttrs) that removes the desired fields, as in the example below

import tarfile, time, os

def createTestFile():
    with open(DIR + "/someFile.txt", "w") as file:
        file.write("test file")

# Takes in a TarInfo and returns the modified TarInfo:
# https://docs.python.org/3/library/tarfile.html#tarinfo-objects
# intented to be passed as a filter to tarfile.add
# https://docs.python.org/3/library/tarfile.html#tarfile.TarFile.add
def tarInfoStripFileAttrs(tarInfo):
    # set time to epoch timestamp 0, aka 00:00:00 UTC on 1 January 1970
    # note that when extracting this tarfile, this time will be shown as the modified date
    tarInfo.mtime = 0

    # file permissions, probably don't want to remove this, but for some use cases you could
    # tarInfo.mode = 0

    # user/group info
    tarInfo.uid= 0
    tarInfo.uname = ''
    tarInfo.gid= 0
    tarInfo.gname = ''

    # stripping paxheaders may not be required
    # see https://stackoverflow.com/questions/34688392/paxheaders-in-tarball
    tarInfo.pax_headers = {}

    return tarInfo


# COMPRESSION_TYPE = "gz" # does not work even with filter
COMPRESSION_TYPE = "bz2"
DIR = "toTar"
if not os.path.exists(DIR):
    os.mkdir(DIR)

createTestFile()

tar1 = tarfile.open("one.tar." + COMPRESSION_TYPE, "w:" + COMPRESSION_TYPE)
tar1.add(DIR)
tar1.close()

tar2 = tarfile.open("two.tar." + COMPRESSION_TYPE, "w:" + COMPRESSION_TYPE)
tar2.add(DIR)
tar2.close()

tar3 = tarfile.open("three.tar." + COMPRESSION_TYPE, "w:" + COMPRESSION_TYPE)
tar3.add(DIR, filter=tarInfoStripFileAttrs)
tar3.close()

# Overwrite the file with the same content, but an updated time
time.sleep(1)
createTestFile()

tar4 = tarfile.open("four.tar." + COMPRESSION_TYPE, "w:" + COMPRESSION_TYPE)
tar4.add(DIR)
tar4.close()


tar5 = tarfile.open("five.tar." + COMPRESSION_TYPE, "w:" + COMPRESSION_TYPE)
tar5.add(DIR, filter=tarInfoStripFileAttrs)
tar5.close()

tar6 = tarfile.open("six.tar." + COMPRESSION_TYPE, "w:" + COMPRESSION_TYPE)
tar6.add(DIR, filter=tarInfoStripFileAttrs)
tar6.close()

$ md5sum one.tar.bz2 two.tar.bz2 three.tar.bz2 four.tar.bz2 five.tar.bz2 six.tar.bz2
0e51c97a8810e45b78baeb1677c3f946  one.tar.bz2      # same as 2
0e51c97a8810e45b78baeb1677c3f946  two.tar.bz2      # same as 1
54a38d35d48d4aa1bd68e12cf7aee511  three.tar.bz2    # same as 5/6
22cf1161897377eefaa5ba89e3fa6acd  four.tar.bz2     # would be same as 1/2, but timestamp has changed
54a38d35d48d4aa1bd68e12cf7aee511  five.tar.bz2     # same as 3, even though timestamp has changed
54a38d35d48d4aa1bd68e12cf7aee511  six.tar.bz2      # same as 3, even though timestamp has changed

You may want to tweak which params are modified and how in your filter function based on your use case.

I'm afraid this didn't work for me. I'm using `bz2` compression and hard-coding tarinfo attributes exactly as per the answer. I'm adding several files to the tarfile, not just one, so I'm guessing it might be a sorting issue (i.e. order in which I'm adding the files). — Sean McCarthy, May 27 '21 at 15:22

Martin Drozdik · Accepted Answer · 2017-07-11T14:02:41.130

4

As a workaround you can use the bzip2 compression instead. It does not seem to have this problem:

import tarfile

tar1 = tarfile.open("one.tar.bz2", "w:bz2")
tar1.add("bin")
tar1.close()

tar2 = tarfile.open("two.tar.bz2", "w:bz2")
tar2.add("bin")
tar2.close()

Running the md5 gives:

martin@martin-UX305UA:~/test$ md5sum one.tar.bz2 two.tar.bz2 
e9ec2fd4fbdfae465d43b2f5ecaecd2f  one.tar.bz2
e9ec2fd4fbdfae465d43b2f5ecaecd2f  two.tar.bz2

edited Jul 11 '17 at 14:02

answered Jul 11 '17 at 13:53

Martin Drozdik

12,742
22
81
146

This is almost the work-around I ended up using for the same problem, but I used xz. At the time this answer was written, that might have been less portable, but it's probably safe to use these days, and I believe xz is likely perform better in both size and speed compared to bz2. – Matthew Nov 15 '21 at 19:49

score 1 · Answer 3 · answered May 27 '21 at 18:45

I needed to archive many files in one tar file (not just one), and the above answers didn't work for me. Instead, I used the Linux tar command with Python's subprocess module:

import subprocess
import shlex 

def make_tarfile_linux(folder_path, filename):
    """
    Make idempotent tarfile for an identical checksum each time.
    However, this method does not filter out unwanted files like Python can...
    """
    tarfile_to_create_path_and_filename = f"/home/user/{filename}"
    tar_command = "tar --sort=name --owner=root:0 --group=root:0 --mtime='UTC 1970-01-01' -cjf"
    command_list = shlex.split(f"{tar_command} {tarfile_to_create_path_and_filename} {folder_path}")
    cp = subprocess.run(command_list)

    return None

score 1 · Answer 4 · answered May 27 '21 at 20:15

1

Sure, you can eliminate dates and other non-file information in the tar and gzip headers, and use the same version of the same compressor with the same settings, all in order to get exactly the same archive bytes.

However doing all that leads me to think that you are solving the wrong problem, and that you will run into issues if someone changes the version of the compressor under you, with signatures not matching from before and after the version change.

I would recommend instead that you generate your signatures using the concatenation of the uncompressed file contents. Then your signature will be naturally independent of all of the things you are currently having to go to some lengths to zero out, and will also be independent of changes in the compression code. Then all you will need to do is to take some care to preserve the order of the files in the archive.

answered May 27 '21 at 20:15

Mark Adler

101,978
13
118
158

I'm resisting the urge to downvote this... but it's not useful. I have the exact same problem, and the reason I need consistent checksums is for caching; specifically, for a Dockerfile `ADD` command. A wrong checksum has a *time* cost but not a *correctness* cost... and other than not compressing at all (and eating a bunch more disk space during a build than necessary), I don't have any way to check the uncompressed checksum. – Matthew Nov 15 '21 at 19:43
@Matthew Free free to downvote! I won't be hurt. I need to point this out whenever someone is attempting to compare signatures of compressed files _in order to verify that the contents are the same_ (which I see often on SO). That is an exercise in folly. In your case, you need to know if the compressed, archived files themselves are identical, in which case the signature on the entire file is the right thing for caching. Then you are interested in reducing the probability of a mismatch, which other answers here already cover. – Mark Adler Nov 16 '21 at 01:22
Resisting because it's not *actually* wrong, just... not applicable to all cases. Comparing the decompressed streams (using checksums or otherwise) requires control over the comparison process, which at least in my case I don't have. And being paranoid is overkill; *my* goal is to make incremental builds faster. An occasional miss in the very infrequent instance that a compression algorithm changes is not a big deal. (That might be an issue if I was comparing archives made months or years apart, or on different platforms, but those aren't my use case.) – Matthew Nov 16 '21 at 14:48

how to create archive whose keep same md5 hash for identical content in Python?

4 Answers4