0

I am looking to parse data from a large number of webpages using Python (>10k) and I am finding that the function I have written to do this often encounters a timeout error every 500 loops. I have attempted to fix this with a try - except code block, but i would like to improve the function so it will re-attempt to open the url four or five times before returning the error. Is there an elegant way to do this?

My code below:

def url_open(url):
    from urllib.request import Request, urlopen
    req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
    try:
        s = urlopen(req,timeout=50).read()
    except urllib.request.HTTPError as e:
        if e.code == 404:
            print(str(e))
        else:
            print(str(e))
            s=urlopen(req,timeout=50).read()
            raise
    return BeautifulSoup(s, "lxml")
user3725021
  • 566
  • 3
  • 14
  • 32
  • Possible duplicate of [How to retry urllib2.request when fails?](http://stackoverflow.com/questions/9446387/how-to-retry-urllib2-request-when-fails) – phss Jan 15 '17 at 08:52

1 Answers1

1

I've used a pattern like this for retrying in the past:

def url_open(url):
    from urllib.request import Request, urlopen
    req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
    retrycount = 0
    s = None
    while s is None:
        try:
            s = urlopen(req,timeout=50).read()
        except urllib.request.HTTPError as e:
            print(str(e))
            if canRetry(e.code):
                retrycount+=1
                if retrycount > 5:
                    raise
                # thread.sleep for a bit
            else:
                raise 

    return BeautifulSoup(s, "lxml")

You just have to define canRetry somewhere else.

GantTheWanderer
  • 1,255
  • 1
  • 11
  • 19