2

I have the problem, that my code to download files from urls using requests stalls for no apparent reason. When I start the script it will download several hundred files, but then it just stops somewhere. If I try the url manually in the browser, the image loads w/o problem. I also tried with urllib.retrieve, but had the same problem. I use Python 2.7.5 on OSX.

Following you find

  • the code I use,
  • the stacktrace (dtruss), while the program is stalling and
  • the traceback, that is printed, when I ctrl-c the process after nothing happend for 10mins and

Code:

def download_from_url(url, download_path):
    with open(download_path, 'wb') as handle:
        response = requests.get(url, stream=True)
        for block in response.iter_content(1024):
            if not block:
                break
            handle.write(block)

def download_photos_from_urls(urls, concept):
    ensure_path_exists(concept)
    bad_results = list()
    for i, url in enumerate(urls):
        print i, url,
        download_path = concept+'/'+url.split('/')[-1]
        try:
            download_from_url(url, download_path)
            print
        except IOError as e:
            print str(e)
    return bad_result

stacktrace:

My-desk:~ Me$ sudo dtruss -p 708 SYSCALL(args) = return

Traceback:

318 http://farm1.static.flickr.com/32/47394454_10e6d7fd6d.jpg
Traceback (most recent call last):
  File "slow_download.py", line 71, in <module>
    if final_path == '':
  File "slow_download.py", line 34, in download_photos_from_urls
    download_path = concept+'/'+url.split('/')[-1]
  File "slow_download.py", line 21, in download_from_url
    with open(download_path, 'wb') as handle:
  File "/Library/Python/2.7/site-packages/requests/models.py", line 638, in generate
    for chunk in self.raw.stream(chunk_size, decode_content=True):
  File "/Library/Python/2.7/site-packages/requests/packages/urllib3/response.py", line 256, in stream
    data = self.read(amt=amt, decode_content=decode_content)
  File "/Library/Python/2.7/site-packages/requests/packages/urllib3/response.py", line 186, in read
    data = self._fp.read(amt)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 567, in read
    s = self.fp.read(amt)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py", line 380, in read
    data = self._sock.recv(left)
KeyboardInterrupt
Framester
  • 33,341
  • 51
  • 130
  • 192
  • It's possible the server might be blocking you... – MattDMo Sep 30 '14 at 16:15
  • 1
    Try setting a low time limit and limit the concurrency. I think you are running into resource limits (max open files for example) but it's hard to say. – Wolph Sep 30 '14 at 16:18
  • @Wolph, thanks for the idea. How can I find out, if this is the case? – Framester Sep 30 '14 at 16:21
  • @MattDMo, thanks for the idea, but wouldn't that mean, I could also not use the browser to access the file anymore? – Framester Sep 30 '14 at 16:22
  • 1
    @Framester not necessarily. Unless you've changed the headers to exactly mimic your web browser, the server can easily pick up on the fact that it's a robot downloading instead of a human browsing. – MattDMo Sep 30 '14 at 16:23
  • Check out [this question](http://stackoverflow.com/questions/10606133/how-to-send-user-agent-in-requests-library-in-python) and [these docs](http://docs.python-requests.org/en/latest/api/#requests.request) for info on setting the `User-Agent` header. http://whatsmyuseragent.com/ will show you exactly what your current user agent is. – MattDMo Sep 30 '14 at 16:28
  • I met a similar bug; basically it feels like urllib3 does not correctly handle a closed socket and keeps busylooping despite being fed 0 bytes over and over again... – Antti Haapala -- Слава Україні Dec 29 '14 at 14:36

2 Answers2

2

So, just to unify all the comments, and propose a potential solution: There are a couple of reasons why your downloads are failing after a few hundred - it may be internal to Python, such as hitting the maximum number of open file handles, or it may be an issue with the server blocking you for being a robot.

You didn't share all of your code, so it's a bit difficult to say, but at least with what you've shown you're using the with context manager when opening the files to write to, so you shouldn't run into problems there. There's the possibility that the request objects are not getting closed properly after exiting the loop, but I'll show you how to deal with that below.

The default requests User-Agent is (on my machine):

python-requests/2.4.1 CPython/3.4.1 Windows/8

so it's not too inconceivable to imagine the server(s) you're requesting from are screening for various UAs like this and limiting their number of connections. The reason you were able to also get the code to work with urllib.retrieve was that its UA is different than requests', so the server allowed it to continue for approximately the same number of requests, then shut it down, too.

To get around these issues, I suggest altering your download_from_url() function to something like this:

import requests
from time import sleep

def download_from_url(url, download_path, delay=5):
    headers = {'Accept-Encoding': 'identity, deflate, compress, gzip', 
               'Accept': '*/*',
               'Connection': 'keep-alive',
               'User-Agent': 'Mozilla/5.0 (Windows NT 6.2; WOW64; rv:28.0) Gecko/20100101 Firefox/28.0'}
    with open(download_path, 'wb') as handle:
        response = requests.get(url, headers=headers) # no stream=True, that could be an issue
        handle.write(response.content)
        response.close()
        sleep(delay)

Instead of using stream=True, we use the default value of False to immediately download the full content of the request. The headers dict contains a few default values, as well as the all-important 'User-Agent' value, which in this example happens to be my UA, determined by using What'sMyUserAgent. Feel free to change this to the one returned by your preferred browser. Instead of messing around with iterating through the content by 1KB blocks, here I just write the entire content to disk at once, eliminating extraneous code and some potential sources for errors - for example, if there was a hiccup in your network connectivity, you could temporarily have empty blocks, and break out out in error. I also explicitly close the request, just in case. Finally, I added an extra parameter to your function, delay, to make the function sleep for a certain number of seconds before returning. I gave it a default value of 5, you can make it whatever you want (it also accepts floats for fractional seconds).

I don't happen to have a large list of image URLs lying around to test this, but it should work as expected. Good luck!

MattDMo
  • 100,794
  • 21
  • 241
  • 231
1

Perhaps the lack of pooling might cause too many connections. Try something like this (using a session):

import requests

session = requests.Session()

def download_from_url(url, download_path):
    with open(download_path, 'wb') as handle:
        response = session.get(url, stream=True)
        for block in response.iter_content(1024):
            if not block:
                break
            handle.write(block)

def download_photos_from_urls(urls, concept):
    ensure_path_exists(concept)
    bad_results = list()
    for i, url in enumerate(urls):
        print i, url,
        download_path = concept+'/'+url.split('/')[-1]
        try:
            download_from_url(url, download_path)
            print
        except IOError as e:
            print str(e)
    return bad_result
Wolph
  • 78,177
  • 11
  • 137
  • 148