Python 2.6 urlib2 timeout issue

Question

It seems I cannot get the urllib2 timeout to be taken into account. I did read - I suppose - all posts related to this topic and it seems I'm not doing anything wrong. Am I correct? Many thanks for your kind help.

Scenario:

I need to check for Internet connectivity before continuing with the remaining of a script. I then wrote a function (Net_Access), which is provided below.

When I execute this code with my LAN or Wifi interface connected, and by checking an existing hostname: all is fine as there is no error or problem, thus no timeout.
If I unplug my LAN connector or I check against a non-existent hostname, the timeout value seems to be ignored. What's wrong with my code please?

Some info:

Ubuntu 10.04.4 LTS (running into a VirtualBox v4.2.6 VM, Host OS is MAC OS X Lion)
cat /proc/sys/kernel/osrelease: 2.6.32-42-generic
Python 2.6.5

My code:

#!/usr/bin/env python

import socket
import urllib2

myhost = 'http://www.google.com'
timeout = 3

socket.setdefaulttimeout(timeout)
req = urllib2.Request(myhost)

try:
    handle = urllib2.urlopen(req, timeout = timeout)
except urllib2.URLError as e:
    socket.setdefaulttimeout(None)
    print ('[--- Net_Access() --- No network access')
else:
    print ('[--- Net_Access() --- Internet Access OK')

1) Working, with LAN connector plugged in

$ $ time ./Net_Access 
[--- Net_Access() --- Internet Access OK

real    0m0.223s
user    0m0.060s
sys 0m0.032s

2) Timeout not working, with LAN connector unplugged

$ time ./Net_Access 
[--- Net_Access() --- No network access

real    1m20.235s
user    0m0.048s
sys 0m0.060s

Added to original post: test results (using IP instead of FQDN)

As suggested by @unutbu (see comments) replacing the FQDN in myhost with an IP address fixes the problem: the timeout is taken into effect.

LAN connector plugged in...
$ time ./Net_Access [--- Net_Access() --- Internet Access OK

real    0m0.289s
user    0m0.036s
sys 0m0.040s

LAN connector unplugged...
$ time ./Net_Access [--- Net_Access() --- No network access

real    0m3.082s
user    0m0.052s
sys 0m0.024s

This is nice, but it means that timeout could only be used with IP and not FQDN. Weird...

Did someone found a way to use urllib2 timeout without getting into pre-DNS resolution and pass IP to the function, or are you first using socket to test connection and then fire urllib2 when you are sure that you can reach the target?

Many thanks.

What happens if you replace `myhost = 'http://www.google.com'` with `myhost = 'http://74.125.228.98'` ? — unutbu, Jan 02 '13 at 18:49
Thanks. I will try tomorrow morning (it's almost 8pm here and my sweetie is asking me to get down for dinner) :-) — user1943566, Jan 02 '13 at 18:50
Why are you using HTTP just to check network access? Being able to fetch a URL is a couple of levels higher than having network access. Besides the fact that this could fail even when you have network access (e.g., your government or company blocks access to Google, or you just have no DNS), the real problem is that there are many more ways that it could timeout or otherwise fail, which makes it harder to get the test right, and harder to interpret the test. Connecting a `socket` to, say, `('8.8.8.8', 53)` or `('74.125.228.98', 80)` is much simpler. — abarnert, Jan 02 '13 at 19:57
As a second question: Why are you testing this at all? Why not just make your connections, and handle errors as they come up? After all, you still have to handle these errors in case you, say, lose your network connection half-way through the script. And what benefit is there from seeing that the test failed instead of the first request? — abarnert, Jan 02 '13 at 19:58
To @unutbu: I did try with fixed IP address instead of FQDN and indeed timeout is working. The issue is then with the DNS resolver process. I will amend my original post accordingly. Tks. — user1943566, Jan 03 '13 at 08:25
@abarnert Regarding your 1st question: I'm using HTTP because within the same function I can test both Network access AND that a given host is alive. For the 2nd question: this is almost answered by my reply to your 1st question I presume, but actually I posted here to have an answer to the timeout issue I was facing, not for the legitimity/usefulness of my code (this is just a POF for the timeout issue I am facing). Anyway, your questions were legit. Thanks for these. — user1943566, Jan 03 '13 at 08:30
@abarnert Another remark: connecting directly with a socket to port udp/53 or even port tcp/80 might also be a problem. Think about corporate policies re. public DNS access and Web browsing via proxy... — user1943566, Jan 03 '13 at 10:06
@user1943566: If you have no public internet access, but do have a web proxy that the default `urllib2.ProxyHandler` can detect and use, `urllib2` will succeed despite not having internet access. So it doesn't do what you asked for. That may be what you actually _want_, but in that case, your question (and code) is misleading. (Also, do you really need to test that www.google.com is alive? If my code told me Google was down, I'd interpret that as most likely a bug in my code…) — abarnert, Jan 03 '13 at 18:57

score 6 · Answer 1 · answered Jan 03 '13 at 19:36

If your problem is with DNS lookup taking forever (or just way too long) to time out when there's no network connectivity, then yes, this is a known problem, and there's nothing you can do within urllib2 itself to fix that.

So, is all hope lost? Well, not necessarily.

First, let's look at what's going on. Ultimately, urlopen relies on getaddrinfo, which (along with its relatives like gethostbyname) is notoriously the one critical piece of the socket API that can't be run asynchronously or interrupted (and on some platforms, it's not even thread-safe). If you want to trace through the source yourself, urllib2 defers to httplib for creating connections, which calls create_connection on socket, which calls socket_getaddrinfo on _socket, which ultimately calls the real getaddrinfo function. This is an infamous problem that affects every network client or server written in every language in the world, and there's no good, easy solution.

One option is to use a different higher-level library that's already solved this problem. I believe requests relies on urllib3 which ultimately has the same problem, but pycurl relies on libcurl, which, if built with c-ares, does name lookup asynchronously, and therefore can time it out.

Or, of course, you can use something like twisted or tornado or some other async networking library. But obviously rewriting all of your code to use a twisted HTTP client instead of urllib2 is not exactly trivial.

Another option is to "fix" urllib2 by monkeypatching the standard library. If you want to do this, there are two steps.

First, you have to provide a timeoutable getaddrinfo. You could do this by binding c-ares, or using ctypes to access platform-specific APIs like linux's getaddrinfo_a, or even looking up the nameservers and communicating with them directly. But the really simple way to do it is to use threading. If you're doing lots of these, you'll want to use a single thread or small threadpool, but for small-scale use, just spin off a thread for each call. A really quick-and-dirty (read: bad) implementation is:

def getaddrinfo_async(*args):
    result = None
    t = threading.Thread(target=lambda: result=socket.getaddrinfo(*args))
    t.start()
    t.join(timeout)
    if t.isAlive():
        raise TimeoutError(blahblahblah)
    return result

Next, you have to get all the libraries you care about to use this. Depending on how ubiquitous (and dangerous) you want your patch to be, you can replace socket.getaddrinfo itself, or just socket.create_connection, or just the code in httplib or even urllib2.

A final option is to fix this at a higher level. If your networking stuff is happening on a background thread, you can throw a higher-level timeout on the whole thing, and if it took more than timeout seconds to figure out whether it's timed out or not, you know it has.

unutbu · Answer 2 · 2013-01-04T11:11:26.283

Perhaps try this:

import urllib2

def get_header(url):
    req = urllib2.Request(url)
    req.get_method = lambda : 'HEAD'
    try:
        response = urllib2.urlopen(req)
    except urllib2.URLError:
        # urllib2.URLError: <urlopen error [Errno -2] Name or service not known>
        return False
    return True

url = 'http://www.kernel.org/pub/linux/kernel/v3.0/linux-3.7.1.tar.bz2'
print(get_header(url))

When I unplug my network adapter, this prints False almost immediately, while under normal conditions this prints True.

I'm not sure why this works so quickly compared to your original code (even without needing to set the timeout parameter), but perhaps it will work for you too.

I did an experiment this morning which did result in get_header not returning immediately. I booted the computer with the router off. Then the router was turned on. Then networking and wireless was enabled through the Ubuntu GUI. This failed to establish a working connection. At this stage, get_header failed to return immediately.

So, here is a heavier-weight solution which calls get_header in a subprocess using multiprocessing.Pool. The object returned by pool.apply_async has a get method with a timeout parameter. If a result is not returned from get_header within the duration specified by timeout, the subprocess is terminated.

Thus, check_http should return a result within about 1 second, under all circumstances.

import multiprocessing as mp
import urllib2

def timeout_function(cmd, timeout = None, args = (), kwds = {}):
    pool = mp.Pool(processes = 1)
    result = pool.apply_async(cmd, args = args, kwds = kwds)
    try:
        retval = result.get(timeout = timeout)
    except mp.TimeoutError as err:
        pool.terminate()
        pool.join()
        raise
    else:
        return retval

def get_header(url):
    req = urllib2.Request(url)
    req.get_method = lambda : 'HEAD'
    try:
        response = urllib2.urlopen(req)
    except urllib2.URLError:
        return False
    return True

def check_http(url):
    try:
        response = timeout_function(
            get_header,
            args = (url, ),
            timeout = 1)
        return response
    except mp.TimeoutError:
        return False

print(check_http('http://www.google.com'))

Python 2.6 urlib2 timeout issue

2 Answers2

Linked