0

I need to verify the contents of zip files generated by a python application. I expect that every time we run the app, it generates the exactly same zip file with same contents (when giving a same input). By contents, I mean only the contents of the files being compressed, not the meta-information of these files or the zip files.

The problem is the zip files reserve some meta-information such as the creation time of each files, which are different every time running the application. These zip files, unfortunately, may contain millions of small files, which make it very unpleasant to extract and calculate the hash value for each small files.

What are good ways to do such test? I have been trying the "md5" method from hashlib, i.e., comparing the md5 value of the zip file to a previously calculated value. However the md5 value is different each time running the app because the meta-information is different. Any idea how can I do this test? I don't mind extracting and re-zipping it using the same meta-info if possible. Note the zip files contain multiple layer of directories.

Luke
  • 720
  • 1
  • 9
  • 22

2 Answers2

1

From what I understand, you are trying to write automated tests to verify the contents of your zip file are what you expect.

md5 seems like a good candidate for that. Now if you have time related data in the zip file, I would suggest you use https://github.com/spulec/freezegun for this. It is designed to "suspend" time so that all calls to datetime functions (now(), today()...) will return a know value. You could do something like:

from freezegun import freeze_time

def test_zipping():
    with freeze_time("2012-01-14 12:34:56"):
        zipfile = create_zip_file(data)
        md5 = hashlib.md5()
        with open(zipfile_name) , "rb" ) as f:
            data = f.read(block_size)
            if not data:
                break
            md5.update(data)
        assert md5.digest() == expected_md5_value

With this, you should be able to take out the randomness of time related calls from your tests.

(inspired by Get MD5 hash of big files in Python since your zip file seems big enough)

Laurent S
  • 4,106
  • 3
  • 26
  • 50
  • Thanks @Laurent S. The contents of the zip file is not time related, it is when creating the zip file, it will writing some kind of time stamp into the meta information of the zip file. But I'll try your suggestion, see if it also freezes the time for the zip module. – Luke Feb 21 '17 at 19:09
  • No, it doesn't work for me. I extract the zip files, then zip them using shutil.make_archive(). The md5 value is still different from run to run. – Luke Feb 21 '17 at 19:41
  • Have you checked that the zip algorithm actually guarantees a stable output for a given input? If not, maybe your testing strategy is not adapted? – Laurent S Feb 21 '17 at 19:45
  • At this point I have no way to check that because I cannot control the meta-information it writes to the file. The files being compressed don't change with the compression and decompression, this is guaranteed. – Luke Feb 21 '17 at 20:29
  • 1
    I meant that you should check the zip file format documentation (https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT) to verify that what you're trying to do makes sense. Alternatively https://docs.python.org/2/library/zipfile.html#zipfile.ZipFile.testzip might be a good way around the problem? – Laurent S Feb 22 '17 at 03:10
  • My problem is the zip file contains meta-information of the files being compressed, and contains some information in the zip headers. Those meta-info is different from run to run. I want to verify that the app can generate exactly same zip file each time running it. Therefore the testzip method, which checks if there are bad files, is not enough for me. I updated my question according to our conversation. Thanks a lot for your help. – Luke Feb 22 '17 at 15:16
1

I like the base idea of Laurent S to make sure you have the very same conditions when you run your tests. As long as you do not consider security an issue I'd agree on using md5.

As you are very unspecific about the meta-data that is different on every run I got curious and made a short test.

zip t1 t00*png 
zip t2 t00*png

Now some meta- change:

touch t00*.png
zip t3 t00*png

Result:

md5sum *.zip
760a4a1c52f3bc6cdd29c1fff7b94c1f  t1.zip
760a4a1c52f3bc6cdd29c1fff7b94c1f  t2.zip
83a8dcb9fe0d50e7b2b8012c8842005e  t3.zip

This implies that - at last my version of zip [1] does produce repeatable content as long as no metadata is changed.

Your changes are - per definition - not part of the files content (e.g. JPEG's EXIF Data is also metadata, but part of the file - while file access date is not). Otherwise you would have no chance to use any hash- function at all.

So if you want a comparable result while the files contents are the same but their metadata (the file-system's metadata) isn't you'd save a huge amount of effort by just tweaking the metadata.

As you are doing some kind of unit- test here you could even use this as a validation of the md5-sum being identical with tweaked metadata and different without.

Proof of concept:

touch t00*.png -d '2000-01-01T0:00'
zip t1 t00*png
touch t00*.png -d
zip t2 t00*png
touch t00*.png -d '2000-01-01T0:00'
zip t3 t00*png

Result:

md5sum *.zip
a1e713c1d91a0042b37043c83bb98d1b  t1.zip
3085aa53bee69df4be783636b87ed62c  t2.zip
a1e713c1d91a0042b37043c83bb98d1b  t3.zip

Last but not least you can try to tweak those areas of your ZIP- File that are not relevant for your test. As ZIP seems to be a well- behaving container format the metadata of my changes show up in neat distances - hardening my assumption that they are headers/footers per file:

cat t1.zip| xxd -ps -c 20 > t1.hd
cat t2.zip| xxd -ps -c 20 > t2.hd
diff t1.hd t2.hd
1c1
< 504b03041400000008000000212822aad7cacc0b
---
> 504b0304140000000800c37a574a22aad7cacc0b
3c3
< 09000370356d3870356d3875780b000104e80300
---
> 0900030df0ae580df0ae5875780b000104e80300
3432c3432
< 6082504b030414000000080000002128143698a4
---
> 6082504b0304140000000800c37a574a143698a4
3434c3434
< 555409000370356d3870356d3875780b000104e8
---
> 55540900030df0ae580df0ae5875780b000104e8
19691,19693c19691,19693
...

Note the obviously minimal differences caused by the metadata- change.

[1] Linux 4.9.9-1-ARCH #1 SMP PREEMPT Thu Feb 9 19:07:09 CET 2017 x86_64 GNU/Linux, <br>
Zip 3.0 (July 5th 2008), by Info-ZIP, Compiled with gcc 5.3.0 for Unix (Linux ELF) on Jan 12 2016.
RuDevel
  • 694
  • 3
  • 14
  • I confirmed this works for files under a same directory. But I am having trouble in making it work for files with multi-level directories. What I am using is the command: find TestDir -exec touch -d '2000-01-01T0:00' {} \; I am sure there are no other hidden files except the "." and ".." inside the directories. Do you know what I missed here? Thanks a lot. – Luke Feb 23 '17 at 20:28
  • I can confirm some instability using subdirs. First guess would be a random order when files are added. Do you zip using wildcards? Could you try and zip a few files 'manually' meaning have each as separate parameter? – RuDevel Feb 23 '17 at 20:49