Decompression using gzip -d is ok, but wrong when using zlib in Python

Question

I had downloaded a .gz file and decompressed it successfully using 'gzip -d'. But it went wrong when I tried to decompress it using python zlib by chunk.

CHUNK = 1024 * 1024
infile = open('2019-07-06-13.log.gz')
d = zlib.decompressobj(32 + zlib.MAX_WBITS)
while True:
    chunk = infile.read(CHUNK)
    if not chunk:
        break
    data = d.decompress(chunk)
    print len(chunk), len(data)
print "#####"

Since the file is small, this loop will only run a time. The print result that "len(data)" is smaller than "len(chunk)" is certainly wrong.

The output:

100576 50389
#####

Meanwhile, after I used gzip -c to recompress the decompressed file I created by using "gzip -d" as I said before, I used my code to decompress the recompressed file and the resulting lens turned to be to right, which means my code works well for the normal gz file.

windows? you need `rb` mode: `infile = open('2019-07-06-13.log.gz','rb')`. If you used python 3 you would have known earlier. — Jean-François Fabre, Jul 06 '19 at 08:50
Isn’t the rest still [in the decompression object](https://docs.python.org/2.7/library/zlib.html#zlib.Decompress.unconsumed_tail)? — Davis Herring, Jul 09 '19 at 06:48
@DavisHerring The file I used to test is very small. So the loop can run once. — hunter_tech, Jul 09 '19 at 07:03
@hunter_tech: You may have *read* the entire file, but that doesn’t mean it’s all been *decompressed* with the one call. — Davis Herring, Jul 09 '19 at 14:10
@DavisHerring, 3Ks！ You gave the key hint to the solution. The problem is that the original gz file is concatenated by many sub gz files, which makes its decompression a little tricky. — hunter_tech, Jul 10 '19 at 03:28
@hunter_tech: It seems that, after each read, you need to loop passing `unconsumed_tail` to `decompress`. I’m not certain, though; that interface seems confusing and error-prone. — Davis Herring, Jul 10 '19 at 03:30
@DavisHerring, you're very close to the answer. I've update it in the post. You should have made a reply and got some votes as reward. — hunter_tech, Jul 10 '19 at 08:57
@hunter_tech: Well, I wasn’t quite right, and I could have written an answer after you verified it. (And it’s good not to obsess over the reputation.) But now you should write an answer—not edit it into the question. — Davis Herring, Jul 10 '19 at 13:16
@DavisHerring, it's one more good advice for me! Appreciations. For a beginner, reputation is critical to pass all these endless authority limitation in this site... — hunter_tech, Jul 11 '19 at 01:46
@hunter_tech: The point of the reputation requirements is not to make you want reputation as quickly as possible, but to make sure you’ve learned how to do things correctly before you try them (to avoid messes and noise). — Davis Herring, Jul 11 '19 at 04:17

score 2 · Answer 1 · answered Jul 11 '19 at 01:43

2

Thanks for the hint from DavisHerring! The key problem is that the origin gz file is concatenated from multiple gz sub-files, making its decompression a little more complex.

Here's the solution:

 CHUNK = 1024 * 1024
 infile = open('2019-07-06-13.log.gz')
 d = zlib.decompressobj(32 + zlib.MAX_WBITS)

 while True:
    chunk = response.read(CHUNK)

    if not chunk:
           break

    data = d.decompress(chunk)
    print len(chunk), len(data)

    while d.unused_data != '':
       buf = d.unused_data
       d = zlib.decompressobj( zlib.MAX_WBITS |16)
       data = d.decompress(buf)
       print len(buf), len(data)

answered Jul 11 '19 at 01:43

hunter_tech

103
6

@DavisHerring， the code is intended to init decompressobj many times to pass each head check in every sub gz files. – hunter_tech Jul 11 '19 at 06:47
@DavisHerring, you mean a chunk may contains content of two subfiles? – hunter_tech Jul 11 '19 at 10:48
Yes, or not all of one. – Davis Herring Jul 11 '19 at 13:12
@DavisHerring， that will be ok. A sub gz file need and only need one decompressobj initialization, that is the rule and i don't break it in my code. Assuming that the last remain part of a compressed chunk is part of a different subfile , this remain will be decompressed normally as the decompressobj has been re-inited after previous gz file decompression is completed in this so-called handle-unused-data-loop. And other parts of this gz subfile in the next chunk will be handled normally 'cause I don't re-init the obj. – hunter_tech Jul 11 '19 at 15:18
Fair enough—I think I was misled by the potential for partial *printing*, but that’s secondary. – Davis Herring Jul 11 '19 at 17:14
@DavisHerring, there is another question. Could you give me an advise again？ https://stackoverflow.com/questions/57105535/streaming-reading-chunk-by-chunk-reading-using-python-urlib2-open-can-only-get – hunter_tech Jul 20 '19 at 02:22

COOL_IRON · Answer 2 · 2019-07-09T06:51:21.763

0

gzip format differs from zlib’s one:

Why does gzip give an error on a file I make with compress/deflate? The compress and deflate functions produce data in the zlib format, which is different and incompatible with the gzip format. The gz* functions in zlib on the other hand use the gzip format. Both the zlib and gzip formats use the same compressed data format internally, but have different headers and trailers around the compressed data. Source: zlib.net

For decompressing .gz files you should use a built-in gzip module.

edited Jul 09 '19 at 06:51

answered Jul 09 '19 at 06:37

COOL_IRON

106
5

I've seen the description you posted. The compatible problem can be fixed by skipping the header check when initializing the decompressor: zlib.decompressobj(32 + zlib.MAX_WBITS) – hunter_tech Jul 09 '19 at 06:53
What I want is to download and decompress a very big file. For efficiency, I want to do streaming decompression all-in-memory while reading the file from net. The example I posted is just to show the zlib problem. – hunter_tech Jul 09 '19 at 07:00

Decompression using gzip -d is ok, but wrong when using zlib in Python

2 Answers2