0

I am trying to get data from different APIs. They are received in JSON format, stored in SQLite and afterwards parsed.

The issue I am having is that when sending many requests I eventually receive an error, even if I am using time.sleep between requests.

Usual approach

My code looks like the one below, where this would be inside a loop and the url to be opened would be changing:

base_url = 'https://www.whateversite.com/api/index.php?'
custom_url = 'variable_text1' + & + 'variable_text2' 

url = base_url + custom_urls #url will be changing

time.sleep(1)
data = urllib2.urlopen(url).read() 

This runs thousands of times inside the loop. The problem comes after the script has been running for a while (up to a couple of hours), then I get the following errors or similar ones:

    data = urllib2.urlopen(url).read()
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 127, in urlopen
    return _opener.open(url, data, timeout)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 404, in open
    response = self._open(req, data)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 422, in _open
    '_open', req)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 382, in _call_chain
    result = func(*args)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 1222, in https_open
    return self.do_open(httplib.HTTPSConnection, req)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 1184, in do_open
    raise URLError(err)
urllib2.URLError: <urlopen error [Errno 8] nodename nor servname provided, or not known>

or

    uh = urllib.urlopen(url)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 87, in urlopen
    return opener.open(url)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 208, in open
    return getattr(self, name)(url)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 437, in open_https
    h.endheaders(data)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 969, in endheaders
    self._send_output(message_body)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 829, in _send_output
    self.send(msg)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 791, in send
    self.connect()
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 1172, in connect
    self.timeout, self.source_address)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py", line 553, in create_connection
    for res in getaddrinfo(host, port, 0, SOCK_STREAM):
IOError: [Errno socket error] [Errno 8] nodename nor servname provided, or not known

I believe this happens because the modules throw an error at some point if used too often in a short period of time.

From what I've read in many different threads about which module is better, I think for my needs all would work and the main lever to choose one is that it can open as many url's as possible. In my experience, urllib and urllib2 are better than requests for that, as requests crashed in less time.

Assuming that I do not want to increase the waiting time used in time.sleep, these are the solutions I thought so far:

Possible solutions?

A

I thought of combining all different modules. That would be:

  • Start, for instance, with requests.
  • After a specific time or when error is thrown, switch automatically to urllib2
  • After a specific time or when error is thrown, switch automatically to other modules (such as httplib2 or urllib) or back to requests
  • And so on...

B

Use try .. except block to handle that exception, as suggested here.

C

I've also read about sending multiple requests at once or in parallel. I don't know how that exactly works and if it could actually be useful


However, I am not convinced about any of these solutions.

Can you think of any more elegant and/or efficient solution to deal with this error?

I'm with Python 2.7

Community
  • 1
  • 1
J0ANMM
  • 7,849
  • 10
  • 56
  • 90
  • Perhaps close the object returned by `urlopen()` or `requests` to close the connection? – zachyee Aug 20 '16 at 18:36
  • What is the url it happens on? – Padraic Cunningham Aug 20 '16 at 19:13
  • @PadraicCunningham, I'm not sure if I understood the question: the url is to an API. It is to get info about travel journeys, so you typically need to give at lease origin_id, destination_id and date. That is for example `url = serviceurl + segment1 + '&' + segment2 + '&' + segment3` where the segments will be different. This url returns pure JSON. – J0ANMM Aug 20 '16 at 19:29
  • @zachyee , how can I close the object returned? In that case, following the code I wrote in the explanation, that object would be `data`, correct? – J0ANMM Aug 20 '16 at 19:33
  • The socket will be closed automatically, I mean what is the actual url that you are using, is it a public api or are you scraping an internal api – Padraic Cunningham Aug 20 '16 at 19:34
  • https://www.DeinBus.de/fapi/trips/index.php?departure=XXX&arrival=XXX&departuredatetime=XXX&timerange=XXX The documentation can be found here: https://www.deinbus.de/affiliate/dokumentation.php but, you need to be registered and provide username and password to really get access to the data. – J0ANMM Aug 20 '16 at 19:39
  • @JoanMM You would end up doing something like `response = urllib.urlopen(url)` `data = response.read()` `response.close()` Once it goes out of scope, it is eligible for garbage collection, but doing it explicitly might help since you're attempting to reach a lot of urls in a small amount of time. Even better would be to use a contextmanager (with keyword). More info here: http://stackoverflow.com/questions/1522636/should-i-call-close-after-urllib-urlopen – zachyee Aug 20 '16 at 20:38
  • Thanks @zachyee. Unfortunately, I tried and still got the error (this time it was thrown quite fast). – J0ANMM Aug 20 '16 at 21:17

1 Answers1

0

Even if I was not convinced, I ended up trying to implement the try .. except block and I'm quite happy with the result:

for url in list_of_urls:
    time.sleep(2)
    try:
        response = urllib2.urlopen(url)
        data = response.read()
        time.sleep(0.1)
        response.close() #as suggested by zachyee in the comments

        #code to save data in SQLite database

    except urllib2.URLError as e:
        print '***** urllib2.URLError: <urlopen error [Errno 8] nodename nor servname provided, or not known> *****'
        #save error in SQLite
        cur.execute('''INSERT INTO Errors (error_type, error_ts, url_queried)
        VALUES (?, ?, ?)''', ('urllib2.URLError', timestamp, url))
        conn.commit()
        time.sleep(30) #give it a small break

The script did run until the end.

From thousands of requests I got 8 errors, that were saved in my database with its related URL. This way I can try to retrieve those url's again afterwards, if needed.

J0ANMM
  • 7,849
  • 10
  • 56
  • 90