5

Is it possible to generate MD5 hash for .zip files in java? All the examples I found were for .txt files.

I want to know when we unzip the data, edit a file, again zip it and find the hash, will it be different from the original one?

mastov
  • 2,942
  • 1
  • 16
  • 33
Priya
  • 489
  • 1
  • 10
  • 20

1 Answers1

6

You can create MD5 hashes for any arbitrary file, independently of the file type. The hash just takes any byte stream and doesn't interpret its meaning at all. So you can use the examples you have found for .txt files and apply them to .zip files.

And yes, editing a file inside the .zip will most likely change the MD5 of the .zip file - even though that's not guaranteed, due to hash collisions. But that's just a general property of hashes and has nothing to do with the zipping.

Note, however, that rezipping files may change the MD5 hash, even if the content has not changed. That's because even though the unzipped files are the same as before, the zipped file may vary depending on the used compression algorithm and its parameters.

EDIT (based on your comment):

If you want to avoid those changing MD5 hashes on rezipping, you have to run the MD5 on the unzipped files. You can do that on-the-fly without actually writing the files to disk, just by using streams. ZipInputStream helps you. A simple code example:

    InputStream theFile = new FileInputStream("example.zip");
    ZipInputStream stream = new ZipInputStream(theFile);
    try
    {
        ZipEntry entry;
        while((entry = stream.getNextEntry()) != null)
        {
            MessageDigest md = MessageDigest.getInstance("MD5");
            DigestInputStream dis = new DigestInputStream(stream, md);
            byte[] buffer = new byte[1024];
            int read = dis.read(buffer);
            while (read > -1) {
                read = dis.read(buffer);
            }
            System.out.println(entry.getName() + ": "
                    + Arrays.toString(dis.getMessageDigest().digest()));
        }
    } finally { stream.close(); }
mastov
  • 2,942
  • 1
  • 16
  • 33
  • Thanks for the reply. But do we have any other way to find the contents inside .zip files are changed? – Priya Jul 07 '15 at 11:47
  • @Priya: If you mind the false positive for changes that I mentioned (changed MD5 on rezipping), you have to extract the files and apply the MD5 to the extracted files. Then you will get the same hash for the same content, guaranteed. But you won't ever get rid of the (extremely unlikely) false negatives for changes (same hash code for different files). They are just a property of hashes you have to accept. If you cannot live with that, don't use hashes. – mastov Jul 07 '15 at 11:51
  • @Priya: Btw. by "extracting" I don't mean you have to write the files physically to disk. You can do that on-the-fly by using Java's zip streams. Here's an example on how to use them, just instead of writing them to disk you can pass them on directly to a MD5 algorithm: http://www.thecoderscorner.com/team-blog/java-and-jvm/12-reading-a-zip-file-from-java-using-zipinputstream – mastov Jul 07 '15 at 11:55
  • @Priya: I've just updated my answer to reflect what we've been discussing here in the comments. – mastov Jul 07 '15 at 12:24
  • Thanks for the input.. I'm trying your solution. – Priya Jul 07 '15 at 12:59
  • I was able to hash zip files with md5, but the hash differs when we zip it again because of the time stamp. – Priya Jul 10 '15 at 12:20
  • @Priya: Time stamp? The code in my answer hashes the *content* of the files within the .zip - no time stamp involved. – mastov Jul 11 '15 at 20:55
  • Its not about the code you shared. You initially gave me an explanation right, for the same zipped file, the hash may differ. It's because of the time it is zipped. – Priya Jul 13 '15 at 05:09
  • 1
    @Priya: I see. Yes, that's unavoidable. Not only because of a time stamp, but only because of different compression algorithms, different options or a different file order. – mastov Jul 13 '15 at 09:00
  • @mastov This does not give the md5 hash of the files inside the zip, instead it gives `[-105, 9, -28, -58, -32, 71, 119, -115, -91, -127, -105, 101, 91, 79, 5, 10]` what are those? and is there any way we can get the `md5hash` of files without extracting the zip. thanks – Kasun Siyambalapitiya Aug 30 '17 at 09:43