12

When downloading a large file with python, I want to put a time limit not only for the connection process, but also for the download.

I am trying with the following python code:

import requests

r = requests.get('http://ipv4.download.thinkbroadband.com/1GB.zip', timeout = 0.5, prefetch = False)

print r.headers['content-length']

print len(r.raw.read())

This does not work (the download is not time limited), as correctly noted in the docs: https://requests.readthedocs.org/en/latest/user/quickstart/#timeouts

This would be great if it was possible:

r.raw.read(timeout = 10)

The question is, how to put a time limit to the download?

Hristo Hristov
  • 4,021
  • 4
  • 25
  • 37
  • I'm not advocating this as the best solution, but here's a general solution for putting time limits on function calls using signals: http://stackoverflow.com/a/601168/471671 . It's a kludge and I don't recommend using it unless a more elegant solution is not available. – Andrew Gorcester Nov 26 '12 at 21:11
  • 1
    Yes, signals are not an option because of http://stackoverflow.com/a/1114567/389463 – Hristo Hristov Nov 26 '12 at 21:16
  • Now you have a timeout parameter in `requests` :) See https://www.hausarztpraxis-altburg.de – DaveFar Nov 29 '22 at 12:37

3 Answers3

11

And the answer is: do not use requests, as it is blocking. Use non-blocking network I/O, for example eventlet:

import eventlet
from eventlet.green import urllib2
from eventlet.timeout import Timeout

url5 = 'http://ipv4.download.thinkbroadband.com/5MB.zip'
url10 = 'http://ipv4.download.thinkbroadband.com/10MB.zip'

urls = [url5, url5, url10, url10, url10, url5, url5]

def fetch(url):
    response = bytearray()
    with Timeout(60, False):
        response = urllib2.urlopen(url).read()
    return url, len(response)

pool = eventlet.GreenPool()
for url, length in pool.imap(fetch, urls):
    if (not length):
        print "%s: timeout!" % (url)
    else:
        print "%s: %s" % (url, length)

Produces expected results:

http://ipv4.download.thinkbroadband.com/5MB.zip: 5242880
http://ipv4.download.thinkbroadband.com/5MB.zip: 5242880
http://ipv4.download.thinkbroadband.com/10MB.zip: timeout!
http://ipv4.download.thinkbroadband.com/10MB.zip: timeout!
http://ipv4.download.thinkbroadband.com/10MB.zip: timeout!
http://ipv4.download.thinkbroadband.com/5MB.zip: 5242880
http://ipv4.download.thinkbroadband.com/5MB.zip: 5242880
Hristo Hristov
  • 4,021
  • 4
  • 25
  • 37
  • Have you seen [GRequests: Asynchronous Requests](https://github.com/kennethreitz/grequests)? – Piotr Dobrogost Nov 27 '12 at 21:06
  • With this code, what happens when the timeout triggers? :) What guarantees do you have as to the state of a socket? – Piotr Dobrogost Nov 27 '12 at 22:31
  • AFAIK, there is no threading here, still operations are running in parallel. When timeout triggers, the non-blocking operation in progress is just cancelled. No killing. The socket is closed. I hope ;) – Hristo Hristov Nov 28 '12 at 05:58
2

When using Requests' prefetch=False parameter, you get to pull in arbitrary-sized chunks of the respone at a time (rather than all at once).

What you'll need to do is tell Requests not to preload the entire request and keep your own time of how much you've spent reading so far, while fetching small chunks at a time. You can fetch a chunk using r.raw.read(CHUNK_SIZE). Overall, the code will look something like this:

import requests
import time

CHUNK_SIZE = 2**12  # Bytes
TIME_EXPIRE = time.time() + 5  # Seconds

r = requests.get('http://ipv4.download.thinkbroadband.com/1GB.zip', prefetch=False)

data = ''
buffer = r.raw.read(CHUNK_SIZE)
while buffer:
    data += buffer
    buffer = r.raw.read(CHUNK_SIZE)

    if TIME_EXPIRE < time.time():
        # Quit after 5 seconds.
        data += buffer
        break

r.raw.release_conn()

print "Read %s bytes out of %s expected." % (len(data), r.headers['content-length'])

Note that this might sometimes use a bit more than the 5 seconds allotted as the final r.raw.read(...) could lag an arbitrary amount of time. But at least it doesn't depend on multithreading or socket timeouts.

shazow
  • 17,147
  • 1
  • 34
  • 35
  • 2
    Unfortunately this does not work, because not only the last, but even every r.raw.read(...) could lag an arbitrary amount of time. This can often lead to missing the timeout with downloads from arbitrary urls. – Hristo Hristov Nov 27 '12 at 05:46
  • Then sounds like socket timeout is the only way to go. – shazow Nov 28 '12 at 22:53
-3

Run download in a thread which you can then abort if not finished on time.

import requests
import threading

URL='http://ipv4.download.thinkbroadband.com/1GB.zip'
TIMEOUT=0.5

def download(return_value):
    return_value.append(requests.get(URL))

return_value = []
download_thread = threading.Thread(target=download, args=(return_value,))
download_thread.start()
download_thread.join(TIMEOUT)

if download_thread.is_alive():
    print 'The download was not finished on time...'
else:
    print return_value[0].headers['content-length']
Piotr Dobrogost
  • 41,292
  • 40
  • 236
  • 366
  • This is not a safe road to take. Threading with python is problematic and also I can't just kill the thread on timeout, this is not a clean solution. – Hristo Hristov Nov 26 '12 at 21:22
  • You can replace thread with process if you like. Why can't you kill the thread? – Piotr Dobrogost Nov 26 '12 at 21:25
  • "It is generally a bad pattern to kill a thread abruptly, in python and in any language." http://stackoverflow.com/a/325528/389463 There is no way to tell the thread to stop. – Hristo Hristov Nov 26 '12 at 21:40
  • Using a process is too complicated, it would require inter-process communication. – Hristo Hristov Nov 26 '12 at 21:43
  • With this code, what happens when the timeout triggers? The thread can potentially live forever, nobody stops it. With multiple, slow downloads in parallel this will lead to thread count explosion. – Hristo Hristov Nov 27 '12 at 05:50
  • Yes, but you can switch to multiprocessing module and there you have `Process.terminate()` which you can use to terminate download process. However if you have multiple downloads then it sounds you would be better off using async approach with grequests and using timeouts on gevent level. – Piotr Dobrogost Nov 27 '12 at 08:46
  • 1
    Threads actually cannot be stopped in Python. They can be _marked_ as stopped with the `stop` method, but they do in fact continue running in the background. – dotancohen Jul 23 '13 at 09:41