Deflate compression spec clarification(s)

Question

my hope for this question (see: bottom) is to lay out as much as I know about the deflate process, and I can receive corrections on areas where I am (perhaps very) misinformed. Hopefully, at the end of it, this question could be a handy resource.

Zlib header

The first two bytes equate to a header for the zlib compression used with the format of (credit)

---CMF---  ---FLG---
0111.1000  1101.0101
CINF -CM-  +-||
           | |+- FCHECK
           | +-- FDICT
           +---- FLEVEL

From RFC 1950, right to left:

FCHECK (1.0101) - validates that CMF & FLG as a 16 bit unsigned int is a multiple of 31
FDICT (0) - if set, indicates preset DICT following immediately after FLG
FLEVEL (11) - compression "intensity" [0-3]
CM (1000) - for compression method, where CM = 8 == "deflate" compression method
CINF (0111) - indicates the size of the sliding window used, where CINF = 7 == 32K sliding window

Data block header

The next three bits in the NEW BYTE equate to a header for the Huffman encoded block:

---CMF---  ---FLG---  NEW BYTE
0111.1000  1101.0101  11101100
                           |-|
                           | +- BFINAL
                           +--- BTYPE

From RFC 1951 right to left:

BFINAL (0) - is set (1) if this is the last block of data
BTYPE (10) - Huffman encoding : (00)none; (01)Fixed Huffman codes; (10) dynamic codes; (11) invalid

The Huffman Codes

From here I will work off the assumption of BTYPE = (10)

The following values immediately proceed:

NEW BYTE                NXT BYTE                  
(11101)100       ->     101)(11101)   ->   0111111(1
       |-|
       | +- BFINAL
       +--- BTYPE

HLIT (11101) - 5-bit number of length/literal codes, 257 added (257-286)
HDIST (11101) - 5-bit number of distance codes, 1 added (1-32)
HCLEN (1111) - 4-bit number of code-length codes, 4 added (4-19)

Immediately following this is HCLEN(don't forget +4) 3-bit fields, where the values are assigned to this sequence, in order:

16 17 18 0 8 7 9 6 10 5 11 4 12 3 13 2 14 1 15

Since HCLEN = 19, the whole sequence is used

A code length of 0 in that sequence means the corresponding symbol is not used.

As a graphical example, after reading the 19x3 bits we have six extra bits(extra bits in brackets):

NXT BYTE 00000000 00000000 00000000 00000000 00000000 00000000 [000000](00

My Question

Do the last bits in brackets above, get thrown away?

I wouldn't downvote, but I have my doubts about the value of this question. 1) it's not about PNG, it's about ZLIB/deflate spec (that PNG uses ZLIB/deflate is irrelevant here - the concept of "decoupling layers" is essential to engineering). 2) It's not a real concrete question, it's rather an appeal to help to understand the details of ZLIB/deflate format using an special example (this will hardly help any other people) 3) it's slightly strange (though perhaps also commendable) to pretend to understand a ZLIB stream at the bit level, specially if you don't know about Huffman compression — leonbloy, Nov 05 '15 at 16:16
... and it smells funny in the context of the question title: you normally don't need to understand things at such a low level to understand the PNG format. 4) see here http://stackoverflow.com/questions/26018811/trying-to-understand-zlib-deflate-in-png-files — leonbloy, Nov 05 '15 at 16:17
I suppose my definition of understanding something in this domain is being able to code it. As in program a decoder. But I would argue the value of the question, it's a great exercise to truly master file I/O with bytes, even down to individual bits. It also involves endian-ness. All areas I am weak in, as a college educated CS student. Maybe the question title was incorrect, can that be changed? Should it be, if it can be? And my (possibly poor) logic for including PNG is that a large amount of searches for deflate would naturally involve PNG. I believe this is awesome stuff to learn from. — Trés DuBiel, Nov 05 '15 at 20:48
@leonbloy Also, one of stack overflow's best features, in my opinion, are the truly in-depth answers that dispel any confusion and lay out an optimal way to do X. I was trying to emulate or provoke such an answer from the community — Trés DuBiel, Nov 05 '15 at 20:52
It's a great exercise , I couldn't agree more! To ask for help for such a good exercise is (perhaps, I'm not sure) not the same as to ask a good question. "I am going to post this as is, go read up on Huffman codes, then update accordingly." That is nice/useful stuff for, say, a tecnical blog post. That's not exactly the idea of this site : concrete answers for concrete questions. — leonbloy, Nov 05 '15 at 21:02
@leonbloy I see now, what you mean. I suppose being on the end that benefits from already established questions, with answers and summaries edited in at a later date after resolution, had somewhat skewed my idea of a good question. Should I attempt to edit it and establish a concrete question to be answered? Thanks for the input — Trés DuBiel, Nov 05 '15 at 21:33
In PNG the IDAT chunk contains the DEFLATE compressed pixels, those first 3 bits for the Huffman block, are they zero padded to finish off the byte, and what endian do they use? Because this PNG file I made it's first byte after the IDAT signature is 0x78, or 0111 1000. do I start reading from the left, so the 3 bits would be 011, or do I skip the highest bit because unsigned and start at 111, or do I read from the right for 000? ALSO: is the huffman table encoded, and how can I find how long it is/where is starts? — MarcusJ, Nov 29 '15 at 05:39
@MarcusJ I initially did all of this because I was studying PNG files. The IDAT chunk is structured exactly how I have it laid out, above. Starting with a zlib header, then the data block header, followed by the huffman code(s), then the encoded data. Your first byte of `0x78` is identical to my example above (CMF byte containing CINF and CM) — Trés DuBiel, Nov 29 '15 at 23:48

score 6 · Accepted Answer · answered Nov 06 '15 at 07:05

6

No. The only times in a deflate stream that you skip bits to go to a byte boundary are for a stored block (00), or when the end code in the last block is read. After the bits for the code length code lengths, you continue with the subsequent bits to use the generated Huffman code to read the code lengths.

answered Nov 06 '15 at 07:05

Mark Adler

101,978
13
118
158

I appreciate the reply Dr. Adler. I discovered your puff.c program, and I borrowed one portion from the construct function. Namely, the part where it tests for over-subscribed/incomplete set of lengths, and it returns negative on the five images I've tested it on, when building the first tree (code-length codes). Could you perhaps reason as to why? Images I've tested have come from gimp(created) or the internet(downloaded). They all had BTYPE of 2, and I read in the values as described above. I also cross checked with puff.c and I'm reading the correct bits. Thanks again. – Trés DuBiel Nov 06 '15 at 23:45
You're probably not reading the correct bits, or you are not interpreting them correctly. – Mark Adler Nov 07 '15 at 03:07
Thank you, I'll review my implementation – Trés DuBiel Nov 07 '15 at 05:40

Deflate compression spec clarification(s)

Zlib header

Data block header

The Huffman Codes

My Question

1 Answers1