I have a compressed file that is a CAB that I wish to extract a file from in Linux. Since there aren't any native CAB extractors on Linux, I figured I'd try my hand at getting one done.
While I've seen the MSZIP documentation[0] as well as [1] and [2], I'm having difficulties in decompressing it even given that each Block is compressed using a modified DEFLATE compressor. While I did use [3] to decompress the blocks, aside for the first block, the rest of the blocks had a lot of missing data [data which were completely nulled [i.e. 0x00 filled].
So I'm lead to believe I'll need to figure it out manually.
Unfortunately, I am having trouble understanding [2]. So, with the following data:
ED 9D 79 70 1C C7 75 87 07 E0 4D 88 24 EE 8B 94
04 1E A2 44 51 04 81 05 08 2D 11 4A C2 5E 00 96
...
And the following code to dissect the data:
#!/bin/env python
import os
def to_bin(in_item):
b = ord(chr(in_item))
retval = bin(b).replace("0b", "")
if len(retval) < 8:
lx = 8 - len(retval)
q = "0" * lx
retval = q + retval
return retval
def swapbytes(in_bit_str):
p = in_bit_str[:4]
q = in_bit_str[4:]
return q + p
pth = os.path.join("comp", "test_0.dat")
fp = open(pth, 'rb')
fpd = fp.read(10)
fp.close()
dx = ""
for item in fpd:
cx = to_bin(item)
scx = swapbytes(cx)
dx += scx
print(cx)
print("---> %s" % scx)
print("Final: %s" % dx)
bfinal = dx[0]
btype = dx[1:3]
print("Bfinal: %s" % bfinal)
print("Btype: %s" % btype)
rest = dx[3:]
hlit = rest[:5]
hdist = rest[5:11]
hclen = rest[11:15]
hlitd = int(hlit, 2)
hdistd = int(hdist, 2)
hclend = int(hclen, 2)
print("HLIT: %s [%d]" % (hlit, hlitd))
print("HDIST: %s [%d]" % (hdist, hdistd))
print("HCLEN: %s [%d]" % (hclen, hclend))
I get the following output:
$ python tstdcmp.py
11101101
---> 11011110
10011101
---> 11011001
01111001
---> 10010111
01110000
---> 00000111
00011100
---> 11000001
11000111
---> 01111100
01110101
---> 01010111
10000111
---> 01111000
00000111
---> 01110000
11100000
---> 00001110
Final: 1101111011011001100101110000011111000001011111000101011101111000011100000
0001110
Bfinal: 1
Btype: 10
HLIT: 11110 [30]
HDIST: 11011 [27]
HCLEN: 0011 [3]
So the basic concept that I got from the RFC1951 [2], was that in a byte, it's from highest bit to lowest bit. i.e. 7 6 5 4 3 2 1 0 But for multiple bytes, i.e. AB CD, the bytes are swapped to CD AB.
So with the code, I determined that the first two bytes ED 9D
should actually be 9D ED
.
And in bits:
11011110 11011001
That would mean BFinal == 1 and BTYPE == 10 (so dynamic huffman encoding).
With the first three bits removed, I get:
11110 11011001
So from 3.2.7 from [2], I should have the following info of the next few sets of bits:
5 Bits: HLIT, # of Literal/Length codes - 257 (257 - 286)
5 Bits: HDIST, # of Distance codes - 1 (1 - 32)
4 Bits: HCLEN, # of Code Length codes - 4 (4 - 19)
So, HLIT = 11110 [30] HDIST = 11011 [27] HCLEN = 0011 [3]
But from the above, the HLIT should be between 257 and 286 and HCLEN between 4 and 9; but I get HLIT = 30 and HCLEN = 3.
That said, the documentation in 3.2.7:
For even greater compactness, the code length sequences
themselves are compressed using a Huffman code.
So is this an example of further compressing with huffman codes? I had hoped to understand this RFC1951[2]; but it's terribly confusing.
I have seen some Youtube videos on DEFLATE compression; but they show mostly on the algorithm and not how the bytes are packed in a file.
Any help greatly appreciated.
Thanks
[1] - How to decompress MSZIP files with c#?
[2] - https://datatracker.ietf.org/doc/html/rfc1951
[3] - https://www.nayuki.io/page/simple-deflate-decompressor