I am looking to parse data from a large number of webpages using Python (>10k) and I am finding that the function I have written to do this often encounters a timeout error every 500 loops. I have attempted to fix this with a try - except code block, but i would like to improve the function so it will re-attempt to open the url four or five times before returning the error. Is there an elegant way to do this?
My code below:
def url_open(url):
from urllib.request import Request, urlopen
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
try:
s = urlopen(req,timeout=50).read()
except urllib.request.HTTPError as e:
if e.code == 404:
print(str(e))
else:
print(str(e))
s=urlopen(req,timeout=50).read()
raise
return BeautifulSoup(s, "lxml")