Martin's answer is correct, but in my case I wanted to ignore the last modified date of each file in the tar as well, so that even if a file was "modified" but with no actual changes, it still has the same hash.
When creating the tar, I can override values I don't care about so they are always the same.
In this example I show that just using a normal tar.bz2, if I re-create my source file with a new creation timestamp, the hash will change (1 and 2 are the same, after re-creation, 4 will differ). However, if I set the time to Unix Epoch 0 (or any other arbitrary time), my files will all hash the same (3, 5 and 6)
To do this you need to pass a filter
function to tar.add(DIR, filter=tarInfoStripFileAttrs)
that removes the desired fields, as in the example below
import tarfile, time, os
def createTestFile():
with open(DIR + "/someFile.txt", "w") as file:
file.write("test file")
# Takes in a TarInfo and returns the modified TarInfo:
# https://docs.python.org/3/library/tarfile.html#tarinfo-objects
# intented to be passed as a filter to tarfile.add
# https://docs.python.org/3/library/tarfile.html#tarfile.TarFile.add
def tarInfoStripFileAttrs(tarInfo):
# set time to epoch timestamp 0, aka 00:00:00 UTC on 1 January 1970
# note that when extracting this tarfile, this time will be shown as the modified date
tarInfo.mtime = 0
# file permissions, probably don't want to remove this, but for some use cases you could
# tarInfo.mode = 0
# user/group info
tarInfo.uid= 0
tarInfo.uname = ''
tarInfo.gid= 0
tarInfo.gname = ''
# stripping paxheaders may not be required
# see https://stackoverflow.com/questions/34688392/paxheaders-in-tarball
tarInfo.pax_headers = {}
return tarInfo
# COMPRESSION_TYPE = "gz" # does not work even with filter
COMPRESSION_TYPE = "bz2"
DIR = "toTar"
if not os.path.exists(DIR):
os.mkdir(DIR)
createTestFile()
tar1 = tarfile.open("one.tar." + COMPRESSION_TYPE, "w:" + COMPRESSION_TYPE)
tar1.add(DIR)
tar1.close()
tar2 = tarfile.open("two.tar." + COMPRESSION_TYPE, "w:" + COMPRESSION_TYPE)
tar2.add(DIR)
tar2.close()
tar3 = tarfile.open("three.tar." + COMPRESSION_TYPE, "w:" + COMPRESSION_TYPE)
tar3.add(DIR, filter=tarInfoStripFileAttrs)
tar3.close()
# Overwrite the file with the same content, but an updated time
time.sleep(1)
createTestFile()
tar4 = tarfile.open("four.tar." + COMPRESSION_TYPE, "w:" + COMPRESSION_TYPE)
tar4.add(DIR)
tar4.close()
tar5 = tarfile.open("five.tar." + COMPRESSION_TYPE, "w:" + COMPRESSION_TYPE)
tar5.add(DIR, filter=tarInfoStripFileAttrs)
tar5.close()
tar6 = tarfile.open("six.tar." + COMPRESSION_TYPE, "w:" + COMPRESSION_TYPE)
tar6.add(DIR, filter=tarInfoStripFileAttrs)
tar6.close()
$ md5sum one.tar.bz2 two.tar.bz2 three.tar.bz2 four.tar.bz2 five.tar.bz2 six.tar.bz2
0e51c97a8810e45b78baeb1677c3f946 one.tar.bz2 # same as 2
0e51c97a8810e45b78baeb1677c3f946 two.tar.bz2 # same as 1
54a38d35d48d4aa1bd68e12cf7aee511 three.tar.bz2 # same as 5/6
22cf1161897377eefaa5ba89e3fa6acd four.tar.bz2 # would be same as 1/2, but timestamp has changed
54a38d35d48d4aa1bd68e12cf7aee511 five.tar.bz2 # same as 3, even though timestamp has changed
54a38d35d48d4aa1bd68e12cf7aee511 six.tar.bz2 # same as 3, even though timestamp has changed
You may want to tweak which params are modified and how in your filter function based on your use case.