0

I am currently reading about the deflate algorithm and as part of learning I picked one file that I zipped using different methods. What I found and what confuses me very much is that the different methods produced different bytes representing the compressed file.

I tried zipping the file using WinRar, 7-Zip, using the Java zlib library(ZipOutputStream class) and also manually by just doing the deflate upon the source data(Deflater class). All of the four methods produced completely different bytes.

My goal was just to see that all of the methods produced the same byte array as a result, but this was not the case and my question is why could that be? I made sure by checking the file headers that all of this software actually used the deflate algorithm.

Can anyone help with this? Is it possible that deflate algorithm can produce different compressed result for exactly the same source file?

Michael Munta
  • 207
  • 2
  • 16

2 Answers2

0

The reason is that Deflate is a format, not an algorithm. The compression happens in two steps: LZ77 (here you have a large choice of algorithms among a quasi infinity of possible algorithms). Then, the LZ77 messages are encoded with Huffman trees (again a very large amount of choices about how to define those trees). Additionally, from time to time in the stream of LZ77 messages, it is good to redefine the trees and start a new block - or not. Here there is again an enormous amount of choices about how to split those blocks.

Zerte
  • 1,478
  • 10
  • 9
  • That makes sense. So the decompression method is also capable of reconstructing original files even though they are all represented differently? – Michael Munta Oct 21 '20 at 12:30
0

There are many, many deflate representations of the same data. Surely you have already noticed that you can set a compression level. That could only have an effect if there were different ways to compress the same data. What you get depends on the compression level, any other compression settings, the software you are using, and the version of that software.

The only guarantee is that when you compress and then decompress, you get exactly what you started with. There is no guarantee, nor does there need to be or should be such a guarantee, that you get the same thing when you decompress and then compress.

Why do you have that goal?

Mark Adler
  • 101,978
  • 13
  • 118
  • 158
  • The goal was purely educational. I just thought the result was supposed to be the same, now I know that it doesn't have to be so. – Michael Munta Oct 21 '20 at 13:41
  • One other thing I tried was using Java's zlib inflate method on data I compressed using WinRar or 7-Zip and it didn't work. Seems it only works on data that was previously also compressed with zlib's deflate method. So I can not be sure that a certain method of decompression will be working on all the various representations of compressed data? – Michael Munta Oct 22 '20 at 10:19
  • Did you extract just the deflate data from the zip files, and use raw inflate in Java? – Mark Adler Oct 22 '20 at 13:30
  • Yes, I started at the offset the deflate data starts and ended it before the trailer. – Michael Munta Oct 22 '20 at 14:04
  • Also what do you mean by raw deflate? I used the one provided by zlib which is java.util.zip.Inflater. I get the error 'incorrect header check'. – Michael Munta Oct 22 '20 at 14:18
  • Ok, so you're not understanding the wrappers. Deflate data can be wrapped in a zlib header and trailer, a gzip header and trailer, or several deflate streams representing different files can be wrapped in a zip file structure set of headers, trailers, and a central dictionary. "Incorrect header check" means you're trying to use a decoder for one of those formats on one of the other formats. Won't work. See this answer: https://stackoverflow.com/a/20765054/1180620 – Mark Adler Oct 22 '20 at 14:48
  • Maybe I understood your answer wrong, but I did not pass the whole zip file to the inflate call. I ignored the zip file headers and trailers and I only provided the raw deflated data to the function. I suppose that since this deflated data was not deflated via zlib in the first place then zlib also can't inflate it. Does that make sense? – Michael Munta Oct 23 '20 at 07:58
  • 1
    Raw deflate means deflate compressed data with no headers or trailers. You can only get "incorrect header check" if you are trying to inflate zlib or gzip data. You need to request raw inflation with `inflateInit2()` to not look for headers. – Mark Adler Oct 23 '20 at 16:40