I'm trying to process a large gzip file pulled from the internet in python using urllib2 and zlib and techniques from these two stackoverflow questions:
This works great, except that after each chunk of the file is read, I need to do some operations on the resultant string which involve a lot of splitting and iterating. This takes some time and when the code goes to do the next req.read()
, it returns nothing, and the program ends, having only read the first chunk.
If I comment out the other operations, the whole file is read and decompressed. Code:
d = zlib.decompressobj(16+zlib.MAX_WBITS)
CHUNK = 16 * 1024
url = 'http://foo.bar/foo.gz'
req = urllib2.urlopen(url)
while True:
chunk = req.read(CHUNK)
if not chunk:
print "DONE"
break
s = d.decompress(chunk)
# ...
# lots of operations with s
# which might take a while
# but not more than 1-2 seconds
Any ideas?
Edit: This turned out to be a bug elsewhere in the program, NOT in the urllib2/zlib handling. Thanks to everyone who helped. I can recommend the pattern used in the code above if you need to handle large gzip files.