3

I'm trying to process a large gzip file pulled from the internet in python using urllib2 and zlib and techniques from these two stackoverflow questions:

This works great, except that after each chunk of the file is read, I need to do some operations on the resultant string which involve a lot of splitting and iterating. This takes some time and when the code goes to do the next req.read(), it returns nothing, and the program ends, having only read the first chunk.

If I comment out the other operations, the whole file is read and decompressed. Code:

d = zlib.decompressobj(16+zlib.MAX_WBITS)
CHUNK = 16 * 1024
url = 'http://foo.bar/foo.gz'
req = urllib2.urlopen(url)
while True:
    chunk = req.read(CHUNK)
    if not chunk:
        print "DONE"
        break
    s = d.decompress(chunk)
    # ...
    # lots of operations with s
    # which might take a while
    # but not more than 1-2 seconds

Any ideas?

Edit: This turned out to be a bug elsewhere in the program, NOT in the urllib2/zlib handling. Thanks to everyone who helped. I can recommend the pattern used in the code above if you need to handle large gzip files.

Community
  • 1
  • 1
beerbajay
  • 19,652
  • 6
  • 58
  • 75

2 Answers2

1

If timing out is the issue, and it's not clear that it is, you could decouple the input reading and processing sides of your code by sticking a queue in the middle and doing the processing in another thread that reads from the queue.

You could also make your chunk size smaller and do less processing per loop.

Tavis Rudd
  • 1,156
  • 6
  • 12
1

This turned out to be a bug elsewhere in the program, NOT in the urllib2/zlib handling. I can recommend the pattern used in the code above if you need to handle large gzip files.

beerbajay
  • 19,652
  • 6
  • 58
  • 75