python + urllib2: streaming ends prematurely

Question

I'm trying to process a large gzip file pulled from the internet in python using urllib2 and zlib and techniques from these two stackoverflow questions:

This works great, except that after each chunk of the file is read, I need to do some operations on the resultant string which involve a lot of splitting and iterating. This takes some time and when the code goes to do the next req.read(), it returns nothing, and the program ends, having only read the first chunk.

If I comment out the other operations, the whole file is read and decompressed. Code:

d = zlib.decompressobj(16+zlib.MAX_WBITS)
CHUNK = 16 * 1024
url = 'http://foo.bar/foo.gz'
req = urllib2.urlopen(url)
while True:
    chunk = req.read(CHUNK)
    if not chunk:
        print "DONE"
        break
    s = d.decompress(chunk)
    # ...
    # lots of operations with s
    # which might take a while
    # but not more than 1-2 seconds

Any ideas?

Edit: This turned out to be a bug elsewhere in the program, NOT in the urllib2/zlib handling. Thanks to everyone who helped. I can recommend the pattern used in the code above if you need to handle large gzip files.

Does the same thing happen when you put the long operations on `s` in its own function? — chown, Oct 26 '11 at 16:32
What happens if you replace the other operations with time.sleep(2)? — Tavis Rudd, Oct 26 '11 at 17:23

Tavis Rudd · Answer 1 · 2011-10-26T17:42:04.920

1

If timing out is the issue, and it's not clear that it is, you could decouple the input reading and processing sides of your code by sticking a queue in the middle and doing the processing in another thread that reads from the queue.

You could also make your chunk size smaller and do less processing per loop.

edited Oct 26 '11 at 17:42

answered Oct 26 '11 at 17:33

Tavis Rudd

1,156
6
12

score 1 · Accepted Answer · answered Mar 20 '12 at 11:16

1

This turned out to be a bug elsewhere in the program, NOT in the urllib2/zlib handling. I can recommend the pattern used in the code above if you need to handle large gzip files.

answered Mar 20 '12 at 11:16

beerbajay

19,652
6
58
75

python + urllib2: streaming ends prematurely

2 Answers2