41

I've a memory- and disk-limited environment where I need to decompress the contents of a gzip file sent to me in string-based chunks (over xmlrpc binary transfer). However, using the zlib.decompress() or zlib.decompressobj()/decompress() both barf over the gzip header. I've tried offsetting past the gzip header (documented here), but still haven't managed to avoid the barf. The gzip library itself only seems to support decompressing from files.

The following snippet gives a simplified illustration of what I would like to do (except in real life the buffer will be filled from xmlrpc, rather than reading from a local file):

#! /usr/bin/env python

import zlib

CHUNKSIZE=1000

d = zlib.decompressobj()

f=open('23046-8.txt.gz','rb')
buffer=f.read(CHUNKSIZE)

while buffer:
  outstr = d.decompress(buffer)
  print(outstr)
  buffer=f.read(CHUNKSIZE)

outstr = d.flush()
print(outstr)

f.close()

Unfortunately, as I said, this barfs with:

Traceback (most recent call last):
  File "./test.py", line 13, in <module>
    outstr = d.decompress(buffer)
zlib.error: Error -3 while decompressing: incorrect header check 

Theoretically, I could feed my xmlrpc-sourced data into a StringIO and then use that as a fileobj for gzip.GzipFile(), however, in real life, I don't have memory available to hold the entire file contents in memory as well as the decompressed data. I really do need to process it chunk-by-chunk.

The fall-back would be to change the compression of my xmlrpc-sourced data from gzip to plain zlib, but since that impacts other sub-systems I'd prefer to avoid it if possible.

Any ideas?

user291294
  • 413
  • 1
  • 4
  • 5

2 Answers2

56

gzip and zlib use slightly different headers.

See How can I decompress a gzip stream with zlib?

Try d = zlib.decompressobj(16+zlib.MAX_WBITS).

And you might try changing your chunk size to a power of 2 (say CHUNKSIZE=1024) for possible performance reasons.

Community
  • 1
  • 1
wisty
  • 6,981
  • 1
  • 30
  • 29
  • That did it perfectly. Thanks. (Now, why isn't this hint in the python docs?) – user291294 Mar 11 '10 at 14:30
  • 4
    zlib is just a wrapper around the c version of zlib. It's not well documented at all. Mind you, the 16+zlib.MAX_WBITS isn't documented the c version either, and it's not the first time I've seen an undocumented zlib feature. – wisty Mar 12 '10 at 17:33
  • definately needs to be documented! – Ross Oct 17 '11 at 03:48
  • This worked fine for my until yesterday. I have a gziped file here that decompresses fine with command-line gzip, decompresses fine with the gzip module in python, but stops prematurely with zlib. As noted elsewhere, gzip wants a real file (that it can seek() on), so I'm now in the market for an alternative gzip and/or zlib implementation. – izak Oct 07 '16 at 12:28
  • Thank you! This should be noted in some official docs. I fought with this for hours... – Jonathan R Jan 19 '19 at 23:06
4

I've got a more detailed answer here: https://stackoverflow.com/a/22310760/1733117

d = zlib.decompressobj(zlib.MAX_WBITS|32)

per documentation this automatically detects the header (zlib or gzip).

Community
  • 1
  • 1
dnozay
  • 23,846
  • 6
  • 82
  • 104
  • I don't want to downvote you, but this simply doesn't work for me. – jds Apr 24 '15 at 14:43
  • @gwg try to be more precise, e.g., what specifically does not work for you. Otherwise, people won't be able to help you. Thanks for your kind understanding. – pedjjj Jan 19 '20 at 15:43
  • I wasn't looking for help as the accepted answer worked for me. I was registering this opinion to save others time. – jds Jan 19 '20 at 16:33