32

I have run into a strange problem with git and zip files. My build script takes a bunch of documentation html pages and zips them into a docs.zip I then check this file into git.

The problem I am having is that every time I re-run the build script and get a new zip file the new zip file has a different SHA1 than the previous run. My build script is calling the ant zip task. However manualy calling the macOSX zip from the Mac OS X shell gives me a different sha1 if I zip up the same directory twice.

Run 1:

zip foo.zip *
openssl sha1 foo.zip 
rm foo.zip 

Run 2:

zip foo.zip *
openssl sha1 foo.zip

Run 1 and run2 give different SHA1 even though the content did not change between runs. In both cases zip prints out exactly the same files being zipped it does not indicate that any OS specific files like .DS_Store are being included in the zip file.

Is the zip algoritm deterministic? If run on the same content will it produce exactly the same bits? if not why not?

What are my choices for zipping the files in a deterministic way? There are thousands of them in the zipped up file, I don't expect those files to change much. I know that git will zip up any files you checkin but the motivation to zip them is to just keep the mass of them out of the way.

ams
  • 60,316
  • 68
  • 200
  • 288
  • 2
    Two things. First it seems that the zip file itself might be included in the zip since it's in the same directory, which might give non-deterministic results. Second the zip might include dates and times which will be different from run to run. – Mark Ransom Mar 15 '12 at 04:52
  • 2
    zip file is not being included in the newly generated zip, I checked that before I posted my questions. – ams Mar 15 '12 at 04:58

4 Answers4

21

According to Wikipedia http://en.wikipedia.org/wiki/Zip_(file_format) seems that zip files have headers for File last modification time and File last modification date so any zip file checked into git will appear to git to have changed if the zip is rebuilt from the same content since. And it seems that there is no flag to tell it to not set those headers.

I am resorting to just using tar, it seems to produce the same bytes for the same input if run multiple times.

Xavier Nodet
  • 5,033
  • 2
  • 37
  • 48
ams
  • 60,316
  • 68
  • 200
  • 288
  • That's right, ZIP archive includes different file information, including file modification time (and for unix - file permissions, owner, creation time and event access time). – Nickolay Olshevsky Apr 23 '13 at 12:22
17

By default, gzip saves file name and time stamp

%> gzip -help 2>&1 | grep -e '-n'
 -N --name            save or restore original file name and time stamp
 -n --no-name         don't save original file name or time stamp

%> gzip -V
Apple gzip 272

Using -n option:

%> tar cv foo/ | gzip -n > foo.tgz; shasum foo.tgz # sha256sum on Ubuntu

you will consistently get the same hash.

Try above without -n and you should see a different hash each time.

chicken_rancher
  • 173
  • 2
  • 8
  • 7
    This is a correct answer, but it would be helpful if you tell the user what it does, and how it solves the problem. From the gzip help "-n --no-name When compressing, do not save the original file name and time stamp by default... " The saved original filenames were affecting the hash. – RED MONKEY Jul 06 '16 at 23:08
7

I had success on creating files with the same SHA1 using the -X (--no-extra) flag for zip.

I created a folder and a couple of files to zip to test it, and as expected, getting different SHA1 hashes everytime:

$ mkdir stuff
$ echo "Stuff 1" > stuff/stuff1.txt
$ echo "Stuff 2" > stuff/stuff2.txt
$ zip -r stuff.zip stuff/
  adding: stuff/ (stored 0%)
  adding: stuff/stuff1.txt (stored 0%)
  adding: stuff/stuff2.txt (stored 0%)

$ shasum stuff.zip
1c8be43ac859bb57603be1243da14022710d22bd  stuff.zip

$ zip -r stuff.zip stuff/
updating: stuff/ (stored 0%)
updating: stuff/stuff1.txt (stored 0%)
updating: stuff/stuff2.txt (stored 0%)

$ shasum stuff.zip
73920362d0f7de74d87286502e03e7126fdc0a6a  stuff.zip

However, using -X gets me the same hash after consecutive zipping:

$ zip -r -X stuff.zip stuff/
updating: stuff/ (stored 0%)
updating: stuff/stuff1.txt (stored 0%)
updating: stuff/stuff2.txt (stored 0%)

$ shasum stuff.zip
1ed228b16d1ee803f26a8b1419f2eb3bf7fcb9f5  stuff.zip

$ zip -r -X stuff.zip stuff/
updating: stuff/ (stored 0%)
updating: stuff/stuff1.txt (stored 0%)
updating: stuff/stuff2.txt (stored 0%)

$ shasum stuff.zip
1ed228b16d1ee803f26a8b1419f2eb3bf7fcb9f5  stuff.zip

I don't have the time to dig in and find out which extra info is causing the difference to popup in the first case, but maybe this could be helpful to someone trying to solve it. Also only tested on macOS 10.12.6.

robertspierre
  • 3,218
  • 2
  • 31
  • 46
Kao Félix
  • 121
  • 2
  • 2
  • 5
    The `-X` flag may work for consecutive zipping of the same file, but won't work for two files with the same name and content, or even the same file with a modified access time. – gsf Jun 05 '18 at 19:20
0

Use below script to create deterministic zip or jar files

#!/bin/bash

usage() {
    echo "Usage : ./createDeterministicArtifact.sh <zip/jar file name>"
    exit 1
}

info() {
    echo "$1"
}

strip_artifact() {
    if [ -z ${file} ]; then
        usage
    fi
    if [ -f ${file} -a -s ${file} ]; then
        mkdir -p ${file}.tmp
        unzip -oq -d ${file}.tmp ${file}
        find ${file}.tmp -follow -exec touch -a -m -t 201912010000.00 {} \+
        if [ "$UNAME" == "Linux" ] ; then
            find ${file}.tmp -follow -exec chattr -a {} \+
        elif [[ "$UNAME" == CYGWIN* || "$UNAME" == MINGW* ]] ; then
            find ${file}.tmp -follow -exec attrib -A {} \+
        fi
        cd ${file}.tmp
        zip -rq -D -X -9 -A --compression-method deflate  ../${file}.new . 
        cd -
        rm -rf ${file}.tmp
        info "Recreated deterministic artifact: ${file}.new"
    else 
        info "Input file is empty. Please validate the file and try again"
    fi
}

file=$1
strip_artifact
Shubham
  • 1
  • 1