urllib request fails when page takes too long to respond

Question

I have a simple function (in python 3) to take a url and attempt to resolve it: printing an error code if there is one (e.g. 404) or resolve one of the shortened urls to its full url. My urls are in one column of a csv files and the output is saved in the next column. The problem arises where the program encounters a url where the server takes too long to respond- the program just crashes. Is there a simple way to force urllib to print an error code if the server is taking too long. I looked into Timeout on a function call but that looks a little too complicated as i am just starting out. Any suggestions?

i.e. (COL A) shorturl (COL B) http://deals.ebay.com/500276625

def urlparse(urlColumnElem):
    try:
        conn = urllib.request.urlopen(urlColumnElem)
    except urllib.error.HTTPError as e:
        return (e.code)
    except urllib.error.URLError as e:
        return ('URL_Error')
    else:
        redirect=conn.geturl()
        #check redirect
        if(redirect == urlColumnElem):
            #print ("same: ")
            #print(redirect)
            return (redirect)
        else:
            #print("Not the same url ")
            return(redirect)

EDIT: if anyone gets the http.client.disconnected error (like me), see this question/answer http.client.RemoteDisconnected error while reading/parsing a list of URL's

John Moutafis · Answer 1 · 2017-04-27T10:56:23.773

2

Have a look at the docs:

urllib.request.urlopen(url, data=None[, timeout])
The optional timeout parameter specifies a timeout in seconds for blocking operations like the connection attempt (if not specified, the global default timeout setting will be used).

You can set a realistic timeout (in seconds) for your process:

conn = urllib.request.urlopen(urlColumnElem, timeout=realistic_timeout_in_seconds)

and in order for your code to stop crushing, move everything inside the try except block:

import socket

def urlparse(urlColumnElem):
    try:
        conn = urllib.request.urlopen(
                   urlColumnElem, 
                   timeout=realistic_timeout_in_seconds
               )
        redirect=conn.geturl()
        #check redirect
        if(redirect == urlColumnElem):
            #print ("same: ")
            #print(redirect)
            return (redirect)
        else:
            #print("Not the same url ")
            return(redirect)

    except urllib.error.HTTPError as e:
        return (e.code)
    except urllib.error.URLError as e:
        return ('URL_Error')
    except socket.timeout as e:
        return ('Connection timeout')

Now if a timeout occurs, you will catch the exception and the program will not crush.

Good luck :)

edited Apr 27 '17 at 10:56

answered Apr 27 '17 at 09:40

John Moutafis

22,254
11
68
112

That partially works, however i just get a timeout error and the program quits, instead of waiting for much longer. 142 Traceback (most recent call last): ... socket.timeout: timed out – Thomas E Apr 27 '17 at 10:49
I have updated my answer, the timeout raises a `socket.timeout` exception – John Moutafis Apr 27 '17 at 10:52
Yep, i got there two by combining the answer from the bottom. For any beginner, also needs the import socket line for it to work Thanks! – Thomas E Apr 27 '17 at 10:53
No problem mate :), I will update the answer to the begginer friendly suggestion you made! – John Moutafis Apr 27 '17 at 10:56
Thanks! I am from the world of C where writing something like this would take me a week without a 3 party library! – Thomas E Apr 27 '17 at 11:00
Good luck in your python adventures :) – John Moutafis Apr 27 '17 at 11:02
Nope, still doesn't work! it now throws a http client remote disconnected error. Do I just add that as an exception? I am trying to dive into the deep end, don't quite understand how the more advanced functions work yet – Thomas E Apr 27 '17 at 11:09
That is a different issue, you should open a new question in SO if you don't find a good solution by yourself... Have a starting point here https://docs.python.org/3/library/http.client.html#http.client.RemoteDisconnected – John Moutafis Apr 27 '17 at 11:21

score 0 · Answer 2 · answered Apr 27 '17 at 09:45

First, there is a timeout parameter than can be used to control the time allowed for urlopen. Next an timeout in urlopen should just throw an exception, more precisely a socket.timeout. If you do not want it to abort the program, you just have to catch it:

def urlparse(urlColumnElem, timeout=5):   # allow 5 seconds by default
    try:
        conn = urllib.request.urlopen(urlColumnElem, timeout = timeout)
    except urllib.error.HTTPError as e:
        return (e.code)
    except urllib.error.URLError as e:
        return ('URL_Error')
    except socket.timeout:
        return ('Timeout')
    else:
        ...

urllib request fails when page takes too long to respond

2 Answers2