4

I am trying to modify the solution shown here: What is the fastest way to send 100,000 HTTP requests in Python? except that instead of checking header status I am making an API request which returns a dictionary and I would like the end result of all of these API requests to be a list of all of the dictionaries.

Here is my code -- consider that api_calls is a list that has each url to open for the json request...

from threading import Thread
from Queue import Queue

concurrent = 200 

def doWork():
    while True:
        url = q.get()
        result = makeRequest(url[0])
        doSomethingWithResult(result, url)
        q.task_done()

def makeRequest(ourl):
    try:
        api_call = urlopen(ourl).read()
        result = json.loads(api_call)
        return result, ourl
    except:
        return "error", ourl

def doSomethingWithResult(result, url):
  print(url,result)

q = Queue(concurrent * 2)
for i in range(concurrent):
    t = Thread(target=doWork)
    t.daemon = True
    t.start()
try:
    for url in api_calls:
        q.put(url)
    q.join()
except KeyboardInterrupt:
    sys.exit(1)

Like the example linked, this currently will succesfully print the url, result on each line. What I would instead like to do is add the (url, result) to a list in each thread and then at the end join them into one master list. I cannot figure out how to have this master list and join the results at the end. Can anybody help with what I should modify in the doSomethingWithResult? If I was doing one large loop, I would just have an empty list and I would append the result to the list after each API request, but I do not know how to mimick this now that I am using threads.

I expect that a common response will be to use https://en.wikipedia.org/wiki/Asynchronous_I/O and if this is the suggestion, then I would appreciate somebody actually providing an example that accomplishes as much as the code I have linked above.

Community
  • 1
  • 1
reese0106
  • 2,011
  • 2
  • 16
  • 46

1 Answers1

9

Use a ThreadPool instead. It does the heavy lifting for you. Here is a working example that fetches a few urls.

import multiprocessing.pool
concurrent = 200 

def makeRequest(ourl):
    try:
        api_call = urlopen(ourl).read()
        result = json.loads(api_call)
        return "success", ourl
    except:
        return "error", ourl

def main():
    api_calls = [
        'http:http://jsonplaceholder.typicode.com/posts/{}'.format(i)
        for i in range(1,5)]

    # a thread pool that implements the process pool API.
    pool = multiprocessing.pool.ThreadPool(processes=concurrent)
    return_list = pool.map(makeRequest, api_calls, chunksize=1)
    pool.close()
    for status, data in return_list:
        print(data)

main()
tdelaney
  • 73,364
  • 6
  • 83
  • 116
  • This certainly looks simpler. However, this does not seem to be working for me. I got an error "AttributeError: 'StdIn' object has no attribute 'close'" so there seems to be some issue? Any idea what is going wrong here? – reese0106 Apr 06 '16 at 23:35
  • I don't see how this example can cause that error. Do you have a couple of sample urls I can test with? Otherwise, if you have updated code that does fail, add that to your original question. – tdelaney Apr 06 '16 at 23:58
  • I updated the example to wrap the pool in a function. This should make no difference for a thread pool but is how you should implement a process pool on windows. I also added some urls so you can just copy/paste to test. – tdelaney Apr 07 '16 at 00:07
  • Mayte there is a problem... I was returning the `Response` object, not a simple status. See the change. – tdelaney Apr 07 '16 at 00:10
  • @tdelaney Okay, this does seem to be working now! I was able to switch "success" to return the results object and it seems to work really well. From a performance perspective, if I am trying to make 100k+ API calls - would there be anything you would do differently to speed things up? – reese0106 Apr 07 '16 at 00:54
  • @tdelaney this works very fast when I supply 100 urls, but when I tried to scale to 1000, it really slowed down (way more than 10x) and I am hoping to scale this to the order of 100k+ requests. Can you think of any reason that this would not scale well beyond 100 requests? – reese0106 Apr 07 '16 at 01:30
  • I'm not sure why, but some guesses.... python threads can thrash, you could do a 2 tiered thing w/ a process pool of the number of CPUs on your system where each process worker runs a thread pool of, say, 32 threads. For the pools, use `imap_unordered` instead of `map` and instead of holding a list in memory, write to files (1 file per process worker) and read them later. – tdelaney Apr 07 '16 at 03:33
  • You may have maxed your network bandwidth. If you send more than your pipe to the internet can handle, packets are dropped and TCP retries come into play. TCP fallbacks are pretty good at adjusting to network bandwidth at the expense of significant delays. The site you are hitting may be limiting how much it will send back. – tdelaney Apr 07 '16 at 03:35