24

I am very new to all of this; I need to obtain data on several thousand sourceforge projects for a paper I am writing. The data is all freely available in json format at the url http://sourceforge.net/api/project/name/[project name]/json. I have a list of several thousand of these URL's and I am using the following code.

import grequests
rs = (grequests.get(u) for u in ulist)
answers = grequests.map(rs)

Using this code I am able to obtain the data for any 200 or so projects I like, i.e. rs = (grequests.get(u) for u in ulist[0:199]) works, but as soon as I go over that, all attempts are met with

ConnectionError: HTTPConnectionPool(host='sourceforge.net', port=80): Max retries exceeded with url: /api/project/name/p2p-fs/json (Caused by <class 'socket.gaierror'>: [Errno 8] nodename nor servname provided, or not known)
<Greenlet at 0x109b790f0: <bound method AsyncRequest.send of <grequests.AsyncRequest object at 0x10999ef50>>(stream=False)> failed with ConnectionError

I am then unable to make any more requests until I quit python, but as soon as I restart python I can make another 200 requests.

I've tried using grequests.map(rs,size=200) but this seems to do nothing.

crf
  • 1,810
  • 3
  • 15
  • 23
  • 2
    I bet sourceforge has an API request limit. They probably only allow 200 requests per 10 seconds per ip address or so – ajon Feb 24 '14 at 03:00
  • @ajon the problem seems to persist for exactly as long as I have one python session running. I just tried waiting for two minutes between sending two 200-size chunks and got error messages on the second one. But I can send the requests without getting errors pretty much immediately as long as I quit python in between. Does that make any sense to you? – crf Feb 24 '14 at 03:14
  • Yeah, once you get past the limit, your connection is stopped by sourcforge. So restarting the session will work, but you can't keep doing this. – aIKid Feb 24 '14 at 03:15
  • @alKid I see. Is there a way to restart the connection without quitting python? – crf Feb 24 '14 at 03:19
  • 1
    just rate-limit your requests by adding a `time.sleep` in your loop. – roippi Feb 24 '14 at 03:24
  • 1
    @roippi the rate doesn't seem to matter though. I tried that with 4 seconds, and it failed, but as I said to ajon, it doesn't seem to care how long I wait between requests. It stops me after 200ish no matter what and won't let me make any more until I restart python. – crf Feb 24 '14 at 03:26

2 Answers2

28

In my case, it was not rate limiting by the destination server, but something much simpler: I didn't explicitly close the responses, so they were keeping the socket open, and the python process ran out of file handles.

My solution (don't know for sure which one fixed the issue - theoretically either of them should) was to:

  • Set stream=False in grequests.get:

     rs = (grequests.get(u, stream=False) for u in urls)
    
  • Call explicitly response.close() after I read response.content:

     responses = grequests.map(rs)
     for response in responses:
           make_use_of(response.content)
           response.close()
    

Note: simply destroying the response object (assigning None to it, calling gc.collect()) was not enough - this did not close the file handles.

Virgil
  • 3,022
  • 2
  • 19
  • 36
  • 2
    Did you have to make chunks of requests for this to work? It would be ideal if there was a way to pass a significantly larger list of urls into grequests and have it automatically close the responses, but that doesn't seem like an option based on the github code and documentation. – neelshiv Mar 09 '16 at 18:54
2

This one can be easily changed to use whichever number of connections you want.

MAX_CONNECTIONS = 100 #Number of connections you want to limit it to
# urlsList: Your list of URLs. 

results = []
for x in range(1,pages+1, MAX_CONNECTIONS):
    rs = (grequests.get(u, stream=False) for u in urlsList[x:x+MAX_CONNECTIONS])
    time.sleep(0.2) #You can change this to whatever you see works better. 
    results.extend(grequests.map(rs)) #The key here is to extend, not append, not insert. 
    print("Waiting") #Optional, so you see something is done.