0

I am trying to get data for multiple clients(details on a csv file) using API calls to an internal system API. Currently, it shows me that for an average over 1000 API calls the response time when running 15 threads is 0.6691 secs. This over the course of half a million requests will add up to about 93 hours. My code without too many details of the API looks like so:

def get_file():
    count = -1
    freader = open(f_name, 'rU')
    csvreader = csv.reader(freader)
    for row in csvreader:
        userid = str(row[0])
        count += 1
        while (activeCount() > 15):
            time.sleep(10)
            continue
        thread = Thread(target=check, args=(userid, count,))
        thread.start()
        # check(userid)
    thread.join()


def check(userid, count):
    headers = {
        'Accept': '*/*',
        'Authorization': authStr,
        'accept-encoding': 'gzip, deflate'
    }
    url = "{}{}/{}/{}".format(api_url, site_id, id_type, userid)
    response = requests.get(url, headers=headers)
    if response.status_code == 404:
        viewed_count = 0
    else:
        viewed_count = json_extract(response.json())

How can I speed this up? What is the maximum number of threads(activeCount) I can specify? And is there an easier, faster and more elegant way to do this?

martineau
  • 119,623
  • 25
  • 170
  • 301
Sourav Dutta
  • 11
  • 1
  • 4
  • Where/when is `get_file()` called? When it joins the `thread` whatever thread it is in will no longer be able to execute (when the GIL is released by other threads or preemptively after every so many byte-code instructions or fixed time-interval depending on Python version). Keeping as many threads as possible runnable would help them get their work done faster. – martineau Jul 21 '17 at 02:30
  • @martineau get_file() is the entry point of the program, there is a bunch of checking criteria and some other API calls which are required to get Auth0 access, file checking, etc. Once all this is completed get_file() is called and it loops over the API calls and gets the details of the call from the .csv file. I tried a bunch of combinations and settled on a thread count of 200... this reduced the time from some 90 hours to about 4 hours. – Sourav Dutta Jul 21 '17 at 17:58
  • OK, but I think the `thread.join()` you're doing at the end is wrong. It does nothing but wait for the last `Thread` instance created in the preceding `for` loop to end—not all of them. If you need `get_file()` to wait for all the threads created to finish, then you need to create a list of them all and wait until they're all finished. A better way to do this would probably be to use a `ThreadPool`. See this [answer of mine](https://stackoverflow.com/a/44072760/355230) for example code. – martineau Jul 21 '17 at 19:59
  • Actually, after looking at the accepted answer to [**_How to limit the number of Thread objects created?_**](https://stackoverflow.com/questions/44071684/how-to-limit-the-number-of-thread-objects-created), using [semaphores](http://effbot.org/zone/thread-synchronization.htm#semaphores) (probably a `BoundedSemaphore`) might be to be a better approach that what's in my answer to the question, assuming your code can be adapted to use them. – martineau Jul 21 '17 at 20:33
  • Thanks @martineau shall implement this over the next few instances. Currently, however, my program seems to be running well. I do agree that thread.join() in its current form does nothing but wait for the last instance to end, which was kind of the point in this case. get_file() is a cul-de-sac and there are no other functions beyond this. I used threading here to run the multiple API requests in parallel. Once the last thread is completed the script finishes running. ThreadPool as you rightly mention is probably the right way to go here without the added complexity of semaphores. – Sourav Dutta Jul 27 '17 at 17:00
  • My point about the use of `join` in your code was that it only waits for the last thread started in the loop to end, which may not always be all of them. – martineau Jul 27 '17 at 17:20

1 Answers1

0

Although threads in Python do not run simultaneously due to GIL limitations (side note), threading might help because the response is blocking the current thread, but doesn't require computation, so the thread is set to sleep.
During this time, another thread can make a request.
Try finding the sweet spot of threads you want making requests.

Yuval Ben-Arie
  • 1,280
  • 9
  • 14
  • Thanks. I've been doing just that. Currently running 200 threads. time per API has reduced to "avg time per request: 0.0582283533613 secs". Initially tried at 500 threads, but got errors thrown from Requests. Do you know if there is an upper limit to how many can be safely used? – Sourav Dutta Jul 21 '17 at 01:23
  • 1
    More threads don't necessarily mean better performance (in your care responded requests). There may also be contentions and other limitations. Also, the server may be limiting the amount of requests per client/ip – Yuval Ben-Arie Jul 21 '17 at 01:29