Handle gzipped or bzip2ed downloads without keeping compressed data

Question

I'd like to download a compressed file (either in gzip or bzip2), decompress it and analyze its contents (it's a CSV-like file with lots of data, I calculate sums, averages and such for certain columns) while the download happens (so that I can show partial results before the download ends). The file is big (4GB), decompressed stream is even bigger, so I don't want to keep the whole compressed file on disk or in memory.

I thought it will be possible to combine python's gzip or bz2 implementations with urllib2:

data_stream = csv.reader(
                  gzip.GzipFile(
                      fileobj=urllib2.urlopen('http://…/somefile.gz')),
                  delimiter='\t')

…but it seems that urlopen's file is not file-like enough for GzipFile. I get a traceback after trying to read from such a stream:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/gzip.py", line 450, in readline
    c = self.read(readsize)
  File "/usr/lib/python2.7/gzip.py", line 256, in read
    self._read(readsize)
  File "/usr/lib/python2.7/gzip.py", line 283, in _read
    pos = self.fileobj.tell()   # Save current position
AttributeError: addinfourl instance has no attribute 'tell'

BZ2 module is even worse—it doesn't allow passing a file object at all.

After looking for some answers, I found this question. The answer works by basically storing the whole compressed file in memory, which is unfeasible for me.

What can I do?

Check out this [answer](http://stackoverflow.com/a/1732726/322909). — John, Mar 22 '13 at 18:08

Mark Adler · Accepted Answer · 2013-03-22T19:44:32.253

3

Use zlib in python. zlib.decompressobj will create an object that can be fed gzip compressed data piecemeal, and spit out the available uncompressed data using the decompress method on the object. You need to set wbits to 31 to decode the gzip format. 15 will decode the zlib format.

edited Mar 22 '13 at 19:44

answered Mar 22 '13 at 19:23

Mark Adler

101,978
13
118
158

Yeah, but then I'll need a wrapper object to make it look like an iterable for csv.reader, right? Is there anything that does this part? – liori Mar 22 '13 at 20:02
That's the easy part. The heavy lifting has been done for you by zlib. – Mark Adler Mar 23 '13 at 04:07

Handle gzipped or bzip2ed downloads without keeping compressed data

1 Answers1