streaming reading (chunk-by-chunk reading) using python urlib2.open can only get partial result

Question

I found a way to do streaming reading in Python in this post's most voted answer.

Stream large binary files with urllib2 to file.

But it went wrong that I could only get partial front data when I was doing some time-consuming task after the chunk had been read.

from urllib2 import urlopen
from urllib2 import HTTPError

import sys
import time

CHUNK = 1024 * 1024 * 16


try:
     response = urlopen("XXX_domain/XXX_file_in_net.gz")
except HTTPError as e:
     print e
     sys.exit(1)


while True:
     chunk = response.read(CHUNK)

     print 'CHUNK:', len(chunk)

     #some time-consuming work, just as example
     time.sleep(60) 

     if not chunk:
            break

If no sleep, the output is right(the total size added is verified to be same with the actual size ):

    CHUNK: 16777216
    CHUNK: 16777216
    CHUNK: 6888014
    CHUNK: 0

If sleep:

    CHUNK: 16777216
    CHUNK: 766580
    CHUNK: 0

And I decompressed these chunk and find only front partial content of the gz file had been read.

…Your network connection is being closed at the other end because you took too long? — Davis Herring, Jul 20 '19 at 05:46
@DavisHerring. Since the server may not be so reliable sometimes, closing or just not sending enough data for the client it'll be wise to support breakpoint-resuming-download in the client. — hunter_tech, Jul 25 '19 at 02:49

hunter_tech · Answer 1 · 2019-07-25T03:22:17.497

Try to support breakpoint-resuming-download in case the server closes the link before sending all enough data.

   try:
        request =  Request(the_url, headers={'Range': 'bytes=0-'})
        response = urlopen(request, timeout = 60)
   except HTTPError as e:
        print e
        return  'Connection Error'

   print dict(response.info())
   header_dict = dict(response.info())

   global content_size
   if 'content-length' in header_dict:
        content_size = int(header_dict['content-length'])

   CHUNK = 16*1024 * 1024

   while True:
       while True:
            try:
                chunk = response.read(CHUNK )
            except socket.timeout, e:
                print 'time_out'
                break
            if not chunk:
                   break

            DoSomeTimeConsumingJob()

            global handled_size
            handled_size = handled_size + len(chunk)

       if handled_size == content_size and content_size != 0:
           break
       else:
          try:
               request =  Request(the_url, headers={'Range': 'bytes='+ str(handled_size) + '-'})
               response = urlopen(request, timeout = 60)
          except HTTPError as e:
               print e

    response.close()

streaming reading (chunk-by-chunk reading) using python urlib2.open can only get partial result

1 Answers1

Linked