Why are my java.util.zip functions showing inconsistent behavior?

Question

I have a Java application that uses the java.util.zip library to compress and decompress files. What I have is a zip file on the server (created by my application) and the client zipping some of his files and uploading the file to the server, but if there's no difference in the underlying files then I don't want to waste the time uploading. I figured that I could calculate the MD5 hash values of the client-side and server-side and see if they're the same, but what's happening is I use my application to decompress a zip file, and then without changing any of the underlying files, I use my application to re-compress it, but the old and new zip files have different MD5 hashes. Does anybody know why this is happening, and if there's a better way to compare two zip files? Thanks.

score 3 · Answer 1 · answered Dec 12 '11 at 17:13

It's even worse, I think:

Doing the same zip-operation twice can result in two different zip-archives:

> zip some.zip some.txt 
  adding: some.txt (stored 0%)
> zip other.zip some.txt
  adding: some.txt (stored 0%)
> ll
total 24
-rw-r--r--  1 cthies  staff  170 12 Dez 18:01 other.zip
-rw-r--r--  1 cthies  staff    4 12 Dez 18:01 some.txt
-rw-r--r--  1 cthies  staff  170 12 Dez 18:01 some.zip
> md5 *.zip
MD5 (other.zip) = f56d7753c5af78427274d930b9fb8c90
MD5 (some.zip) = e2f0382c4ad31871f62fb559157df8e8

Looking in the binaries, one can see difference in just one place:

> xxd some.zip > some.xxd
> xxd other.zip > other.xxd
> colordiff *.xxd
3c3
< 0000020: 6d65 2e74 7874 5554 0900 0363 33e6 4e78  me.txtUT...c3.Nx
---
> 0000020: 6d65 2e74 7874 5554 0900 0363 33e6 4e64  me.txtUT...c3.Nd

I think (depending on the zip-app itself) the current system time can/will be involved. Thus any zip-operation - on exactly the same sources - can(!) be unique and therefore the checksums can't be assumed equal.

Time-independent tools I found: tar, 7z. (both command-line) I.e. tar and 7z reproduces archives with equal checksums (md5).

(tested on OSX 10.6.8 with command-line zip utility)

score 1 · Answer 2 · edited Jan 29 '11 at 00:58

1

Just a wild shot in the dark -- are the two file systems you are calculating your hash values on differently cased?

That is, is one of them Windows, which treats ABC.CLASS and abc.class file names as identical, and one of the a Unix variant which treats ABC.CLASS and abc.class as different?

Just a wild guess...

EDIT: You might also look at the embedded directory separator characters / \ . or : inside the zip file.

edited Jan 29 '11 at 00:58

LT.

111
4

answered Jan 28 '11 at 19:34

Joe Zitzelberger

4,238
2
28
42

3

Could also be file timestamps that changes. – nos Jan 28 '11 at 19:39

score 1 · Answer 3 · answered Jan 28 '11 at 19:44

1) Check the time stamps on the files. The files made by unziping might have a different last modified date and or creation date. That file metadata might be used to create the hash.

2) Are you using the same OS on both systems? If the OSes are different they might be using a different character encoding.

3) Can you diff the zip files? Different MD5 hashes should mean different data. It will be messy but you might get some clues by comparing the raw files.

rfeak · Answer 4 · 2011-01-29T02:14:21.043

0

You cannot compare the resulting zip files from differing zip programs and expect them to be exactly the same, even if the exact same files were used before compression.

Zipping a file is not guaranteed to be deterministic between two different implementations of the zip encodings. Zip works by replacing repeated sections of data with what amounts to a look up key. Two different algorithms can determine the dictionary (set of repeated data) differently, in an effort to optimize the compression levels. Yet, both implementations can create valid zip files that when un-zipped result in the same file.

The only reliable way to do this would be to guarantee that the exact same zip algorithm is being used in both cases.

EDIT: This is why you see different compression level settings in the Java implementation of the Deflate algorithm http://download.oracle.com/javase/1.5.0/docs/api/java/util/zip/Deflater.html

edited Jan 29 '11 at 02:14

answered Jan 29 '11 at 02:07

rfeak

8,124
29
28

1

Indeed, the same algorithm, with the same options and tolerance settings. – Lawrence Dol Jan 29 '11 at 02:19
Exactly! I should have mentioned that as well. – rfeak Jan 29 '11 at 02:21
The OP says they are using java.util.zip for both ends of the process. Except in some very unusual cases, that would imply that the zipping algorithm will be the same, provided by Sun^h^h^h Oracle. – Joe Zitzelberger Jan 31 '11 at 14:40
@Joe Zitzelberger - That's not how I read it. Maybe the OP could clarify. He specifically calls out that he has a server side zip file created by his application, but he separates that from saying the client creates a zip. He did not specify that the client necessarily uses the exact same algorithm and settings. – rfeak Jan 31 '11 at 15:29

Why are my java.util.zip functions showing inconsistent behavior?

4 Answers4