3

I am curious about the zlib data format and trying to understand the zlib header as described in RFC1950 (https://www.rfc-editor.org/rfc/rfc1950). I am however new to this kind of low level interpretation and seem to have run afoul with some of my conclusions.

I have the following compressed data (from a PDF stream object):

b'h\xdebbd\x10`b`Rcb`\xb0ab`\xdc\x0b\xa4\x93\x98\x18\xfe>\x06\xb2\xed\x01\x02\x0c\x00!\xa4\x03\xc4'

In python, I have successfully decompressed and re-compressed the data:

b'x\xdacbd\x10`b`Rcb`\xb0ab`\xdc\x0b\xa4\x93\x98\x18\xfe>\x06\xb2\xed\x01!\xa4\x03\xc4'

As I have understood the discussion/answer in Deflate and inflate for PDF, using zlib C++ The difference in result of the compressed data should not matter as it is an effect of different applied methods to compress the data.

Assuming the last four bytes !\xa4\x03\xc4 are the ADLER32 (Adler-32 checksum) my questions pertain to the first 2 bytes.

  0   1     0   1   2   3                             0   1   2   3
+---+---+ +---+---+---+---+ +=====================+ +---+---+---+---+
|CMF|FLG| |    [DICTID]   | |...compressed data...| |    ADLER32    |
+---+---+ +---+---+---+---+ +=====================+ +---+---+---+---+

CMF

The first byte represents the CMF, which in my two instances would be

  • chr h = dec 104 = hex 68 = 01101000
  • and chr x = dec 120 = hex 78 = 01111000

This byte is divided into a 4-bit compression method and a 4-bit information field depending on the compression method.

  • bits 0 to 3 CM Compression method

  • bits 4 to 7 CINFO Compression info

+----|----+      +----|----+     +----|----+
|0000|0000| i.e. |0110|1000| and |0111|1000|
+----|----+      +----|----+     +----|----+
  CM |CINFO        CM |CINFO       CM |CINFO

Where

[CM] identifies the compression method used in the file. CM = 8 denotes the "deflate" compression method with a window size up to >32K. This is the method used by gzip and PNG (see CM = 15 is reserved.

and

For CM = 8, CINFO is the base-2 logarithm of the LZ77 window size, minus eight (CINFO=7 indicates a 32K window size). Values of CINFO above 7 are not allowed in this version of the specification. CINFO is not defined in this specification for CM not equal to 8.

As I understand it,

  • the only valid CM is 8
  • CINFO can be 0-7

Cf https://stackoverflow.com/a/34926305/7742349

You should NOT assume that it's always 8. Instead, you should check it and, if it's not 8, throw a "not supported" error.

Cf https://groups.google.com/forum/#!msg/comp.compression/_y2Wwn_Vq_E/EymIVcQ52cEJ

An exhaustive list of all 64 current possibilities for zlib headers:

COMMON
78 01
78 5e
78 9c
78 da
RARE
08 1d   18 19   28 15   38 11   48 0d   58 09   68 05
08 5b   18 57   28 53   38 4f   48 4b   58 47   68 43
08 99   18 95   28 91   38 8d   48 89   58 85   68 81
08 d7   18 d3   28 cf   38 cb   48 c7   58 c3   68 de
VERY RARE
08 3c   18 38   28 34   38 30   48 2c   58 28   68 24   78 3f
08 7a   18 76   28 72   38 6e   48 6a   58 66   68 62   78 7d
08 b8   18 b4   28 b0   38 ac   48 a8   58 a4   68 bf   78 bb
08 f6   18 f2   28 ee   38 ea   48 e6   58 e2   68 fd   78 f9

Q1 My first question is simply

  • Why is the CINFO before the CM?, i.e.,
  • why is it not 87, 80, 81, 82, 83, ...

As far as I know, byte order is not an issue here. I suspect it may be related to the least significant bit (RFC1950 § 2.1. Overall conventions), but I cannot quite understand how it would result in, e.g., 78 instead of 87...

Q2 My second question

  • If CINFO 7 represents "a window size up to 32K", then what does 1-6 correspond to? (assuming 0 means window size 0, as in, no compression applied).

FLG

The second byte represents the FLG

\xde -> 11011110
\xda -> 11011010

[FLG] [...] is divided as follows:

  • bits 0 to 4 FCHECK (check bits for CMF and FLG)

  • bit 5 FDICT (preset dictionary)

  • bits 6 to 7 FLEVEL (compression level)

+-----|-|--+      +-----|-|--+     +-----|-|--+
|00000|0|00| i.e. |11011|1|10| and |11011|0|10|
+-----|-|--+      +-----|-|--+     +-----|-|--+
   C  |D| L          C  |D| L         C  |D| L

Bit 0-4 as far as I can tell is some form of "checksum" or integrity control?

Bit 5 indicate whether a dictionary is present.

FDICT (Preset dictionary) If FDICT is set, a DICT dictionary identifier is present immediately after the FLG byte. The dictionary is a sequence of bytes which are initially fed to the compressor without producing any compressed output. DICT is the Adler-32 checksum of this sequence of bytes (see the definition of ADLER32 below). The decompressor can use this identifier to determine which dictionary has been used by the compressor.

Q3 My third question

Assuming that "1" indicates "is set"

\xde -> 11011_1_10
\xda -> 11011_0_10

According to the specification DICTID consist of 4 bytes. The four following bytes in the compressed streams I have are

bbd\x10
cbd\x10

Why are the compressed data from the PDF stream object (with the FDICT 1) and the compressed data with python zlib (with the FDICT 0) almost identical?

Granted that I do not understand the function of the DICTID, but is it not supposed to exist only if FDICT is set?

Q4 My fourth question

Bit 6-7 sets the FLEVEL (Compression level)

These flags are available for use by specific compression methods. The "deflate" method (CM = 8) sets these flags as follows:

0 - compressor used fastest algorithm

1 - compressor used fast algorithm

2 - compressor used default algorithm

3 - compressor used maximum compression, slowest algorithm

The information in FLEVEL is not needed for decompression; it is there to indicate if recompression might be worthwhile.

I would have thought that the flags would be:

0 (00)
1 (01)
2 (10)
3 (11)

However from the What does a zlib header look like?

01 (00000001) - No Compression/low
[5e (01011100) - Default Compression?]
9c (10011100) - Default Compression
da (11011010) - Best Compression

I note however that the two left-most bits seem to correspond to what I have expected; I feel am obviously failing to comprehend something fundamental in how to interpret bits...

Community
  • 1
  • 1
user212827
  • 33
  • 1
  • 5
  • Let's start with your CM/CINFO - the bits 0-3 are on the right side (less significant bits) so in all your samples CM is bin 1000 or 8. Maybe you take this information into account and start thinking it over again? – Stefan Hegny Mar 21 '17 at 18:53

1 Answers1

5

The RFC says:

CMF (Compression Method and flags)
         This byte is divided into a 4-bit compression method and a 4-
         bit information field depending on the compression method.

            bits 0 to 3  CM     Compression method
            bits 4 to 7  CINFO  Compression info

The least significant bit of a byte is bit 0. The most significant bit is bit 7. So the diagram you made for mapping CM and CINFO to bits is backwards. 0x78 and 0x68 both have a CM of 8. Their CINFO's are 7 and 6 respectively.

CINFO is what the RFC says it is:

CINFO (Compression info)
   For CM = 8, CINFO is the base-2 logarithm of the LZ77 window
   size, minus eight (CINFO=7 indicates a 32K window size).

So, a CINFO of 7 means a 32 KiB window. 6 means a 16 KiB. CINFO == 0 does not mean no compression. It means a window size of 256 bytes.

For the flag byte, you got it backwards again. FDICT is not set. For both of your examples, the compression level is 11, maximum compression.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158
  • I am reading up on most and least bit significance. I think my problem was that I interpreted "the bits are numbered 76543210" as reference to the value of the bits, that is, `87` or `8>7` (which is the same order as CM and then CINFO). If I understand it correctly, now, it simply refers to the bit positions, that is, 0-3 refers to bits occupying positions `3<--2<--1<--0` but read/written `3-->0`? – user212827 Mar 21 '17 at 22:29
  • So... `CINFO 5 = 8192, 4 = 4096, 3 = 2048, 2 = 1024, 1 = 512` (2^8+[0, 1, 2, 3, 4 ,5, 6, 7]). – user212827 Mar 21 '17 at 22:30
  • I don't understand that last comment. Anyway, yes, just like in decimal, the most significant digit is written first and the least significant last. They are numbered from least to most. – Mark Adler Mar 22 '17 at 00:01
  • 1
    Apologies for delayed reply. The last comment referred to the value of CINFO (“what does 1-6 correspond to”). Math, like programming, is not my field :) but if `log_2(32768) = 15; 15 - 8 = 7` and `log_2(256) = 8; 8 - 8 = 0` the reverse should be `2^(8+7) = 32768` and `2^(8 + 0) = 256` respectively. Therefore 2^(8+x) where x in [0, 1, 2, 3, 4 ,5, 6, 7] should answer the question “what does 1-6 correspond to”. – user212827 Mar 24 '17 at 11:38