How do I decompress a MSZIP block?

Question

I have a compressed file that is a CAB that I wish to extract a file from in Linux. Since there aren't any native CAB extractors on Linux, I figured I'd try my hand at getting one done.

While I've seen the MSZIP documentation[0] as well as [1] and [2], I'm having difficulties in decompressing it even given that each Block is compressed using a modified DEFLATE compressor. While I did use [3] to decompress the blocks, aside for the first block, the rest of the blocks had a lot of missing data [data which were completely nulled [i.e. 0x00 filled].

So I'm lead to believe I'll need to figure it out manually.

Unfortunately, I am having trouble understanding [2]. So, with the following data:

ED 9D 79 70 1C C7 75 87 07 E0 4D 88 24 EE 8B 94
04 1E A2 44 51 04 81 05 08 2D 11 4A C2 5E 00 96
...

And the following code to dissect the data:

#!/bin/env python

import os


def to_bin(in_item):
    b = ord(chr(in_item))
    retval = bin(b).replace("0b", "")
    if len(retval) < 8:
        lx = 8 - len(retval)
        q = "0" * lx
        retval = q + retval
    return retval


def swapbytes(in_bit_str):
    p = in_bit_str[:4]
    q = in_bit_str[4:]
    return q + p


pth = os.path.join("comp", "test_0.dat")

fp = open(pth, 'rb')
fpd = fp.read(10)
fp.close()

dx = ""
for item in fpd:
    cx = to_bin(item)
    scx = swapbytes(cx)
    dx += scx
    print(cx)
    print("---> %s" % scx)

print("Final: %s" % dx)

bfinal = dx[0]
btype = dx[1:3]

print("Bfinal: %s" % bfinal)
print("Btype: %s" % btype)

rest = dx[3:]

hlit = rest[:5]
hdist = rest[5:11]
hclen = rest[11:15]

hlitd = int(hlit, 2)
hdistd = int(hdist, 2)
hclend = int(hclen, 2)
print("HLIT: %s [%d]" % (hlit, hlitd))
print("HDIST: %s [%d]" % (hdist, hdistd))
print("HCLEN: %s [%d]" % (hclen, hclend))

I get the following output:

$ python tstdcmp.py
11101101
---> 11011110
10011101
---> 11011001
01111001
---> 10010111
01110000
---> 00000111
00011100
---> 11000001
11000111
---> 01111100
01110101
---> 01010111
10000111
---> 01111000
00000111
---> 01110000
11100000
---> 00001110
Final: 1101111011011001100101110000011111000001011111000101011101111000011100000
0001110
Bfinal: 1
Btype: 10
HLIT: 11110 [30]
HDIST: 11011 [27]
HCLEN: 0011 [3]

So the basic concept that I got from the RFC1951 [2], was that in a byte, it's from highest bit to lowest bit. i.e. 7 6 5 4 3 2 1 0 But for multiple bytes, i.e. AB CD, the bytes are swapped to CD AB.

So with the code, I determined that the first two bytes ED 9D should actually be 9D ED.

And in bits: 11011110 11011001

That would mean BFinal == 1 and BTYPE == 10 (so dynamic huffman encoding).

With the first three bits removed, I get: 11110 11011001

So from 3.2.7 from [2], I should have the following info of the next few sets of bits:


 5 Bits: HLIT, # of Literal/Length codes - 257 (257 - 286)
 5 Bits: HDIST, # of Distance codes - 1        (1 - 32)
 4 Bits: HCLEN, # of Code Length codes - 4     (4 - 19)

So, HLIT = 11110 [30] HDIST = 11011 [27] HCLEN = 0011 [3]

But from the above, the HLIT should be between 257 and 286 and HCLEN between 4 and 9; but I get HLIT = 30 and HCLEN = 3.

That said, the documentation in 3.2.7:

For even greater compactness, the code length sequences
themselves are compressed using a Huffman code.

So is this an example of further compressing with huffman codes? I had hoped to understand this RFC1951[2]; but it's terribly confusing.

I have seen some Youtube videos on DEFLATE compression; but they show mostly on the algorithm and not how the bytes are packed in a file.

Any help greatly appreciated.

Thanks

[0] - https://learn.microsoft.com/en-us/openspecs/exchange_server_protocols/ms-mci/27f0a9bf-9567-4e40-ad66-6ae9ab9d2786

[1] - How to decompress MSZIP files with c#?

[2] - https://datatracker.ietf.org/doc/html/rfc1951

[3] - https://www.nayuki.io/page/simple-deflate-decompressor

Mark Adler · Accepted Answer · 2021-08-07T20:12:58.157

No, the "basic concept" you thought you got from RFC 1951 is completely wrong. First off, the bits in each byte are read from least significant to most significant. Second, you do not reverse the bytes in the stream. The first eight bits are first read from the first byte, the second eight bits from the second byte, and so on. (Lengths in stored blocks are stored little-endian, but that is neither reversed nor not reversed. It is simply how a 16-bit length is serialized in the byte stream.)
Once you read the bits correctly, the HLIT, etc. values are stored exactly as stated in the RFC. The first five bits are the number of literal/length codes minus 257. So you take the value of the five bits, which gives a number in 0..31, and add 257 to that. That gives a number in the range 257..288. The allowed range is actually 257..286 as noted on that same line, so the last two possible values of the five bits, 30 and 31, should not appear in a valid deflate stream.
RFC 1951 is not confusing at all. It is a clear and complete description of the format. However you need to have sufficient background in compression, in particular Huffman codes, to understand it. The RFC was not intended to be a textbook on compression, nor a textbook on how integers are coded in bits.
It is clear it would take you some time to figure this all out. Fortunately, you do not need to write your own inflator. You can instead use zlib. Read the documentation in zlib.h for all of the inflate functions.
In CAB files, MSZIP CFDATA blocks use the history from previous CFDATA blocks, until a folder boundary is reached. Even though each block is a properly terminated deflate stream, the next block can refer to uncompressed data from the previous block. To process CFDATA blocks after the first, you will need to use the inflateResetKeep() function of zlib to restart the inflate process while retaining the dictionary from the previous inflate operation.

For reference, here is a decoding of the initial bytes of the deflate stream you provided, using infgen:

! infgen 2.5 output
!
last            ! 1
dynamic         ! 10
count 286 30 16     ! 1100 11101 11101
code 16 4       ! 100
code 17 7       ! 111
code 0 4        ! 100 000
code 8 3        ! 011
code 7 4        ! 100
code 9 3        ! 011
code 6 4        ! 100
code 10 3       ! 011
code 5 4        ! 100
code 11 3       ! 011
code 4 5        ! 101
code 12 3       ! 011
code 3 7        ! 111
code 2 6        ! 110 000
lens 6          ! 1100
lens 8          ! 000
lens 8          ! 000
lens 9          ! 001
repeat 6        ! 11 1110
lens 9          ! 001
lens 8          ! 000
lens 10         ! 010
lens 9          ! 001
lens 9          ! 001
lens 9          ! 001
lens 8          ! 000
repeat 6        ! 11 1110
repeat 4        ! 01 1110
lens 9          ! 001
lens 9          ! 001
lens 10         ! 010
lens 10         ! 010
lens 10         ! 010
lens 8          ! 000
lens 9          ! 001
repeat 3        ! 00 1110
lens 10         ! 010
lens 9          ! 001
lens 10         ! 010
lens 10         ! 010
lens 9          ! 001
lens 10         ! 010
lens 9          ! 001
lens 10         ! 010
lens 9          ! 001
lens 8          ! 000
lens 9          ! 001
lens 8          ! 000
lens 8          ! 000
lens 7          ! 1101
lens 8          ! 000
lens 8          ! 000
lens 9          ! 001
lens 8          ! 000
lens 10         ! 010
lens 7          ! 1101
lens 9          ! 001
lens 8          ! 000
lens 12         ! 100
lens 9          ! 001
lens 10         ! 010
lens 10         ! 010
lens 10         ! 010
lens 8          ! 000
lens 7          ! 1101
repeat 4        ! 01 1110
lens 8          ! 000
lens 8          ! 000
lens 8          ! 000
lens 7          ! 1101
lens 9          ! 001

Thanks for pointing out where I went wrong. I'm sorry for the dunce-like confusion. Just to clarify, in my example, ```ED```(which is 11101101) is actually 10110111? — ewokx, Aug 08 '21 at 14:33
No. `ED` is always `11101101`. However you start reading bits from the _bottom_ (least significant), not the top. That way you get the bits in the correct order when two or more bits represent a value. That byte is broken up as `11101`, `10`, `1`. Where the first thing is that `1` on the bottom. That is the "last" bit, indicating that this first block is also the last block in the deflate stream. Then the next one up is `10` indicating a dynamic code block. The next one up is `11101` = 29. 29 + 257 = 286, indicating that there are 286 literal/length code lengths in the header for this block. — Mark Adler, Aug 08 '21 at 16:54
Note my points 4 and 5 in the answer. Just use zlib and the `inflateResetKeep()` function to extract MSZIP compressed data in a CAB file. — Mark Adler, Aug 08 '21 at 16:56
thanks for the explanation. Python's zlib module doesn't have inflateResetKeep() so I'll just attempt to read the zlib code (I *really* am bad at C unfortunately). And that ```11101``` ```10``` and ```1``` way of dissecting a byte requires a bit of time to digest. Again, Thanks!! — ewokx, Aug 09 '21 at 09:48
Python's `decompressobj()` has an optional dictionary parameter that you can use for this. Just keep the last 32K (or all of it if less than 32K) of the uncompressed data from the previous CFDATA block and feed that as the dictionary for the next block. — Mark Adler, Aug 09 '21 at 18:50
I'm soooo very grateful!! That worked like a charm!! Thank you very very much for the help! — ewokx, Aug 10 '21 at 01:52

How do I decompress a MSZIP block?

1 Answers1