I am trying to get data from different APIs. They are received in JSON format, stored in SQLite and afterwards parsed.
The issue I am having is that when sending many requests I eventually receive an error, even if I am using time.sleep
between requests.
Usual approach
My code looks like the one below, where this would be inside a loop and the url to be opened would be changing:
base_url = 'https://www.whateversite.com/api/index.php?'
custom_url = 'variable_text1' + & + 'variable_text2'
url = base_url + custom_urls #url will be changing
time.sleep(1)
data = urllib2.urlopen(url).read()
This runs thousands of times inside the loop. The problem comes after the script has been running for a while (up to a couple of hours), then I get the following errors or similar ones:
data = urllib2.urlopen(url).read()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 127, in urlopen
return _opener.open(url, data, timeout)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 404, in open
response = self._open(req, data)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 422, in _open
'_open', req)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 1222, in https_open
return self.do_open(httplib.HTTPSConnection, req)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 1184, in do_open
raise URLError(err)
urllib2.URLError: <urlopen error [Errno 8] nodename nor servname provided, or not known>
or
uh = urllib.urlopen(url)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 87, in urlopen
return opener.open(url)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 208, in open
return getattr(self, name)(url)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 437, in open_https
h.endheaders(data)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 969, in endheaders
self._send_output(message_body)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 829, in _send_output
self.send(msg)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 791, in send
self.connect()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 1172, in connect
self.timeout, self.source_address)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py", line 553, in create_connection
for res in getaddrinfo(host, port, 0, SOCK_STREAM):
IOError: [Errno socket error] [Errno 8] nodename nor servname provided, or not known
I believe this happens because the modules throw an error at some point if used too often in a short period of time.
From what I've read in many different threads about which module is better, I think for my needs all would work and the main lever to choose one is that it can open as many url's as possible. In my experience, urllib
and urllib2
are better than requests
for that, as requests
crashed in less time.
Assuming that I do not want to increase the waiting time used in time.sleep
, these are the solutions I thought so far:
Possible solutions?
A
I thought of combining all different modules. That would be:
- Start, for instance, with
requests
. - After a specific time or when error is thrown, switch automatically
to
urllib2
- After a specific time or when error is thrown, switch automatically to other modules (such as
httplib2
orurllib
) or back torequests
- And so on...
B
Use try .. except
block to handle that exception, as suggested here.
C
I've also read about sending multiple requests at once or in parallel. I don't know how that exactly works and if it could actually be useful
However, I am not convinced about any of these solutions.
Can you think of any more elegant and/or efficient solution to deal with this error?
I'm with Python 2.7