3

I am moving to Python from other language and I am not sure how to properly tackle this. Using the urllib2 library it is quite easy to set up a proxy and get a data from a site:

import urllib2

req = urllib2.Request('http://www.voidspace.org.uk')
response = urllib2.urlopen(req)
the_page = response.read()

The problem I have is that the text file that is retrieved is very large (hundreds of MB) and the connection is often problematic. The code also need to catch connection, server and transfer errors (it will be a part of small extensively used pipeline).

Could anyone suggest how to modify the code above to make sure the code automatically reconnects n times (for example 100 times) and perhaps split the response into chunks so the data will be downloaded faster and more reliably?

I have already split the requests as much as I could so now have to make sure that the retrieve code is as good as it can be. Solutions based on core python libraries are ideal.

Perhaps the library is already doing the above in which case is there any way to improve downloading large files? I am using UNIX and need to deal with a proxy.

Thanks for your help.

  • If you don't mind a solution that uses an external library, this is a duplicate of http://stackoverflow.com/questions/16694907/how-to-download-large-file-in-python-with-requests-py – Lie Ryan Mar 21 '15 at 03:38
  • Thank Lie Ryan, I will have a look although I do prefer to do it with core libraries so I don't have to force people to install anything extra. –  Mar 21 '15 at 14:11

2 Answers2

1

I'm putting up an example of how you might want to do this with the python-requests library. The script below checks if the destinations file already exists. If the partially destination file exists, it's assumed to be the partially downloaded file, and tries to resume the download. If the server claims support for a HTTP Partial Request (i.e. the response to a HEAD request contains Accept-Range header), then the script resume based on the size of the partially downloaded file; otherwise it just does a regular download and discard the parts that are already downloaded. I think it should be fairly straight forward to convert this to use just urllib2 if you don't want to use python-requests, it'll probably be just much more verbose.

Note that resuming downloads may corrupt the file if the file on the server is modified between the initial download and the resume. This can be detected if the server supports strong HTTP ETag header so the downloader can check whether it's resuming the same file.

I make no claim that it is bug-free. You should probably add a checksum logic around this script to detect download errors and retry from scratch if the checksum doesn't match.

import logging
import os
import re
import requests

CHUNK_SIZE = 5*1024 # 5KB
logging.basicConfig(level=logging.INFO)

def stream_download(input_iterator, output_stream):
    for chunk in input_iterator:
        output_stream.write(chunk)

def skip(input_iterator, output_stream, bytes_to_skip):
    total_read = 0
    while total_read <= bytes_to_skip:
        chunk = next(input_iterator)
        total_read += len(chunk)
    output_stream.write(chunk[bytes_to_skip - total_read:])
    assert total_read == output_stream.tell()
    return input_iterator

def resume_with_range(url, output_stream):
    dest_size = output_stream.tell()
    headers = {'Range': 'bytes=%s-' % dest_size}
    resp = requests.get(url, stream=True, headers=headers)
    input_iterator = resp.iter_content(CHUNK_SIZE)
    if resp.status_code != requests.codes.partial_content:
        logging.warn('server does not agree to do partial request, skipping instead')
        input_iterator = skip(input_iterator, output_stream, output_stream.tell())
        return input_iterator
    rng_unit, rng_start, rng_end, rng_size = re.match('(\w+) (\d+)-(\d+)/(\d+|\*)', resp.headers['Content-Range']).groups()
    rng_start, rng_end, rng_size = map(int, [rng_start, rng_end, rng_size])
    assert rng_start <= dest_size
    if rng_start != dest_size:
        logging.warn('server returned different Range than requested')
        output_stream.seek(rng_start)
    return input_iterator

def download(url, dest):
    ''' Download `url` to `dest`, resuming if `dest` already exists
        If `dest` already exists it is assumed to be a partially 
        downloaded file for the url.
    '''
    output_stream = open(dest, 'ab+')

    output_stream.seek(0, os.SEEK_END)
    dest_size = output_stream.tell()

    if dest_size == 0:
        logging.info('STARTING download from %s to %s', url, dest)
        resp = requests.get(url, stream=True)
        input_iterator = resp.iter_content(CHUNK_SIZE)
        stream_download(input_iterator, output_stream)
        logging.info('FINISHED download from %s to %s', url, dest)
        return

    remote_headers = requests.head(url).headers
    remote_size = int(remote_headers['Content-Length'])
    if dest_size < remote_size:
        logging.info('RESUMING download from %s to %s', url, dest)
        support_range = 'bytes' in [s.strip() for s in remote_headers['Accept-Ranges'].split(',')]
        if support_range:
            logging.debug('server supports Range request')
            logging.debug('downloading "Range: bytes=%s-"', dest_size)
            input_iterator = resume_with_range(url, output_stream)
        else:
            logging.debug('skipping %s bytes', dest_size)
            resp = requests.get(url, stream=True)
            input_iterator = resp.iter_content(CHUNK_SIZE)
            input_iterator = skip(input_iterator, output_stream, bytes_to_skip=dest_size)
        stream_download(input_iterator, output_stream)
        logging.info('FINISHED download from %s to %s', url, dest)
        return
    logging.debug('NOTHING TO DO')
    return

def main():
    TEST_URL = 'http://mirror.internode.on.net/pub/test/1meg.test'
    DEST = TEST_URL.split('/')[-1]
    download(TEST_URL, DEST)

main()
Lie Ryan
  • 62,238
  • 13
  • 100
  • 144
  • Wow, a very thorough code, big thanks. The request library seems quite flexible, I will give it a go. –  Mar 22 '15 at 15:11
0

You can try something like this. It reads the file line by line and appends it to a file. It also checks to make sure that you don't go over the same line. I'll write another script that does it by chunks as well.

import urllib2
file_checker = None
print("Please Wait...")
while True:
    try:
        req = urllib2.Request('http://www.voidspace.org.uk')
        response = urllib2.urlopen(req, timeout=20)
        print("Connected")
        with open("outfile.html", 'w+') as out_data:
            for data in response.readlines():
                file_checker = open("outfile.html")
                if data not in file_checker.readlines():
                    out_data.write(str(data))
        break
    except urllib2.URLError:
        print("Connection Error!")
        print("Connecting again...please wait")
file_checker.close()
print("done")

Here's how to read the data in chunks instead of by lines

import urllib2

CHUNK = 16 * 1024
file_checker = None
print("Please Wait...")
while True:
    try:
        req = urllib2.Request('http://www.voidspace.org.uk')
        response = urllib2.urlopen(req, timeout=1)
        print("Connected")
        with open("outdata", 'wb+') as out_data:
            while True:
                chunk = response.read(CHUNK)
                file_checker = open("outdata")
                if chunk and chunk not in file_checker.readlines():
                 out_data.write(chunk)
                else:
                    break
        break
    except urllib2.URLError:
        print("Connection Error!")
        print("Connecting again...please wait")
file_checker.close()
print("done")
reticentroot
  • 3,612
  • 2
  • 22
  • 39
  • With the way you check for duplicate chunk, this script is likely to corrupt your downloads if you have a file with lots of repeating structure. – Lie Ryan Mar 21 '15 at 03:20
  • You know, I didn't think about that. That's a good point! How would you go about doing it? I've been thinking about it for a while. – reticentroot Mar 21 '15 at 03:21
  • If the server supports HTTP Range request, I'd have sent that to skip having to redownload the already downloaded parts. Otherwise, if the server doesn't support Range for the request, I'd read the outfile's current file size and just throw away that much data, before continuing as usual. In either case, it's a good idea to do a checksum to ensure that bugs that cause file corruption can be detected. – Lie Ryan Mar 21 '15 at 03:27
  • Hi hrand, many thanks for your hard work! Two questions: (1) Why do I have to check for duplicate chunks? If I want to download the file exactly as is can I just append everything without this check or am I missing something? (2) If I understand the structure correctly, the script will keep reconnecting until all the data is completely downloaded? What if the server is down for some reason, would this not keep the script trying for however long the server is out? Would it be not a good idea to limit the number of failed attempts? –  Mar 21 '15 at 14:10
  • Yes your right. I wasnt sure if when the connection dropped if the current chunk location would be maintained. So I added that as a check. But drop it if it's redundant. You can limit failed atempted with a simple counter. Set counter to 0, and if the connection fails increment the counter. Then instead of having the while loop set to True change it to while counter is less than some number – reticentroot Mar 21 '15 at 15:28
  • Your welcome, also take note of @Lie Ryan. If the chunk location isn't maintain the code may need to be modified, " I'd read the outfile's current file size and just throw away that much data, before continuing as usual." -@Lie Ryan – reticentroot Mar 21 '15 at 15:37