I'd like to download a compressed file (either in gzip or bzip2), decompress it and analyze its contents (it's a CSV-like file with lots of data, I calculate sums, averages and such for certain columns) while the download happens (so that I can show partial results before the download ends). The file is big (4GB), decompressed stream is even bigger, so I don't want to keep the whole compressed file on disk or in memory.
I thought it will be possible to combine python's gzip or bz2 implementations with urllib2:
data_stream = csv.reader(
gzip.GzipFile(
fileobj=urllib2.urlopen('http://…/somefile.gz')),
delimiter='\t')
…but it seems that urlopen's file is not file-like enough for GzipFile. I get a traceback after trying to read from such a stream:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/gzip.py", line 450, in readline
c = self.read(readsize)
File "/usr/lib/python2.7/gzip.py", line 256, in read
self._read(readsize)
File "/usr/lib/python2.7/gzip.py", line 283, in _read
pos = self.fileobj.tell() # Save current position
AttributeError: addinfourl instance has no attribute 'tell'
BZ2 module is even worse—it doesn't allow passing a file object at all.
After looking for some answers, I found this question. The answer works by basically storing the whole compressed file in memory, which is unfeasible for me.
What can I do?