1

I found a way to do streaming reading in Python in this post's most voted answer.

Stream large binary files with urllib2 to file.

But it went wrong that I could only get partial front data when I was doing some time-consuming task after the chunk had been read.

from urllib2 import urlopen
from urllib2 import HTTPError

import sys
import time

CHUNK = 1024 * 1024 * 16


try:
     response = urlopen("XXX_domain/XXX_file_in_net.gz")
except HTTPError as e:
     print e
     sys.exit(1)


while True:
     chunk = response.read(CHUNK)

     print 'CHUNK:', len(chunk)

     #some time-consuming work, just as example
     time.sleep(60) 

     if not chunk:
            break

If no sleep, the output is right(the total size added is verified to be same with the actual size ):

    CHUNK: 16777216
    CHUNK: 16777216
    CHUNK: 6888014
    CHUNK: 0

If sleep:

    CHUNK: 16777216
    CHUNK: 766580
    CHUNK: 0

And I decompressed these chunk and find only front partial content of the gz file had been read.

hunter_tech
  • 103
  • 6

1 Answers1

1

Try to support breakpoint-resuming-download in case the server closes the link before sending all enough data.

   try:
        request =  Request(the_url, headers={'Range': 'bytes=0-'})
        response = urlopen(request, timeout = 60)
   except HTTPError as e:
        print e
        return  'Connection Error'

   print dict(response.info())
   header_dict = dict(response.info())

   global content_size
   if 'content-length' in header_dict:
        content_size = int(header_dict['content-length'])

   CHUNK = 16*1024 * 1024

   while True:
       while True:
            try:
                chunk = response.read(CHUNK )
            except socket.timeout, e:
                print 'time_out'
                break
            if not chunk:
                   break

            DoSomeTimeConsumingJob()

            global handled_size
            handled_size = handled_size + len(chunk)

       if handled_size == content_size and content_size != 0:
           break
       else:
          try:
               request =  Request(the_url, headers={'Range': 'bytes='+ str(handled_size) + '-'})
               response = urlopen(request, timeout = 60)
          except HTTPError as e:
               print e

    response.close()
hunter_tech
  • 103
  • 6