-1

I spent a few days reading zlib (and gzip and deflate) RFC and I can say they are kind of rubbish. Quite some details are missing, so I'm opening this question.

I'm trying to parse a zlib data and I need to know some details about the header.

First of all, RFC says there will be 2 bytes, CMF and FLG.

CMF is divided in 2 4 bits sections. The first one is CM and the second one is CINFO.

What are the possible values of CM? RFC says that 8 means deflate and that 15 is reserved, but what about the rest of the possible values?

CINFO on the other side, should be always 8, if I understand the RFC correctly (please correct me if I'm wrong).

Skipping FLG and the possible FDICT, we get to the Compressed data section. This part of the RFC says:

For compression method 8, the compressed data is stored in the
deflate compressed data format as described in the document
"DEFLATE Compressed Data Format Specification" by L. Peter
Deutsch. (See reference [3] in Chapter 3, below)

What does this mean? Should I assume that CM will always be 8? If yes, then why does the entire CM thing exists?

Last, I'm a bit confused. I always believe zlib can wrap both deflate and gzip, but reading this RFC I can't see where a gzip compressed data fits in here. Is there anything that I'm missing about this?

alexandernst
  • 14,352
  • 22
  • 97
  • 197

1 Answers1

5

What are the possible values of CM? RFC says that 8 means deflate and that 15 is reserved, but what about the rest of the possible values?

...

Should I assume that CM will always be 8? If yes, then why does the entire CM thing exists?

CM is there for future use and to allow other (non-standard) compression methods:

Other compressed data formats are not specified in this version of the zlib specification. (RFC 1950, "ZLIB Compressed Data Format Specification version 3.3")

You should NOT assume that it's always 8. Instead, you should check it and, if it's not 8, throw a "not supported" error.


CINFO on the other side, should be always 8, if I understand the RFC correctly (please correct me if I'm wrong).

No, the meaning of CINFO depends on CM. If CM is 8 (the only meaningful standardized value), then:

CINFO is the base-2 logarithm of the LZ77 window size, minus eight (CINFO=7 indicates a 32K window size). Values of CINFO above 7 are not allowed in this version of the specification. (RFC 1950, "ZLIB Compressed Data Format Specification version 3.3")

So in fact CINFO can't be 8.


Skipping FLG and the possible FDICT, we get to the Compressed data section. This part of the RFC says:

For compression method 8, the compressed data is stored in the
deflate compressed data format as described in the document
"DEFLATE Compressed Data Format Specification" by L. Peter
Deutsch. (See reference [3] in Chapter 3, below)

What does this mean?

It means that the details for the DEFLATE encoding is not specified in this standard, but is described elsewhere, at ftp://ftp.uu.net/pub/archiving/zip/zlib/.

If you prefer, DEFLATE has its own RFC, that is RFC 1951, "DEFLATE Compressed Data Format Specification version 1.3".


Last, I'm a bit confused. I always believe zlib can wrap both deflate and gzip, but reading this RFC I can't see where a gzip compressed data fits in here. Is there anything that I'm missing about this?

No, zlib can't wrap gzip. gzip and zlib are different wrappers for deflate data (as is the zip format, the PNG format, the PDF format, etc.)

Gzip uses DEFLATE:

The format presently uses the DEFLATE method of compression but can be easily extended to use other compression methods. (RFC 1952, "GZIP file format specification version 4.3")

CM = 8 denotes the "deflate" compression method with a window size up to 32K. This is the method used by gzip and PNG (RFC 1950, "ZLIB Compressed Data Format Specification version 3.3")


If you find the RFC unclear or difficult to understand, consider looking into the source code for an implementation of zlib. While some implementations may be non-standard, looking at the source may help you solve some of your doubts.

Here's an excerpt from the source code of zlib from zlib.net that answers one of your questions:

#define Z_DEFLATED   8
/* ... */
if (BITS(4) != Z_DEFLATED) { 
    strm->msg = (char *)"unknown compression method";
    state->mode = BAD;
    break;
}
Community
  • 1
  • 1
Andrea Corbellini
  • 17,339
  • 3
  • 53
  • 69
  • Thank you for taking your time to reply. I have a few more questions. What exactly is the `LZ77 window size`? Do I need it just to parse (not inflate) the zlib data? It doesn't look like I need it. – alexandernst Jan 21 '16 at 17:34
  • @alexandernst: LZ77 works by finding repeated patterns (so that `abcdefabc` becomes `abcdef`). The window size is the amount of bytes that should be searched for repeated patterns. The higher the window size, the more efficient the compression and the higher the memory/CPU requirements. – Andrea Corbellini Jan 21 '16 at 18:08
  • One last question. I'm looking at an example zlib file right now. First 2 bytes are `8B CA`. FCHECK is 11 and the unsigned value of those 2 bytes is 8075. But 8075%31 is not 11. Am I checking incorrectly the FCHECK? – alexandernst Jan 21 '16 at 18:38
  • Sorry, was misreading the file. I'm able to calc the FCHECK correctly. – alexandernst Jan 21 '16 at 19:13
  • If you are parsing the deflate data (I guess just to find where it ends?), then you don't need the sliding window. If you are decompressing then you need either a 32K sliding window, in which case you don't care what the zlib header says, or if you are tight on memory, you can allocate a smaller sliding window if the zlib header says that you are allowed to, because the data was compressed with a smaller window. – Mark Adler Jan 22 '16 at 02:55