9

The idea is simple: I need to send multiple HTTP requests in parallel.

I've decided to use requests-futures library for that, which basically spawns multiple threads.

Now, I have about 200 requests and it's still pretty slow (takes about 12 seconds on my laptop). I'm also using a callback to parse the response json (as suggested in the library documentation). Also, is there a rule of thumb to figure out the optimal number of threads as a function of the number of requests, is there any?

Basically, I was wondering if I can speed up those requests any further.

Nikolay Derkach
  • 1,734
  • 2
  • 22
  • 34
  • What version of python? Your stdlib options change pretty dramatically from 2.7 to 3.3. – roippi Nov 20 '13 at 20:57
  • I was going to suggest urllib + Threading module, but the package you linked to does essentially the same thing. As far as tuning the number of threads, I've run 25 or so without a problem on my laptop (MacBook Pro, 3.2 GHz processor, 16 GB RAM). – BenDundee Nov 20 '13 at 20:59
  • @roippi I'm using python 3.3 – Nikolay Derkach Nov 20 '13 at 21:08
  • @BenDundee how many requests were you sending and how was the performance? – Nikolay Derkach Nov 20 '13 at 21:09
  • @NikolayDerkach: I can't answer your question directly, but I have two separate examples where I use something like a threaded POST or GET. The closest to your case is using `pyelasticsearch` (which uses the requests library) to post JSON (query) an elastic index. Threading those queries out, I typically execute ~15000 POSTs across 5 threads in ~20-30 minutes (I can't use more or I get frantic emails from the platform guy). There's a payload of a few hundred KB. Anyway, I'd find it hard to believe that `requests` is significantly more or less performant than `urllib`. – BenDundee Nov 21 '13 at 01:01
  • Don't hold me to this, but I'd guess you could get your execution time down below a minute, if all you're doing is posting. Here's a good primer on threading: http://www.ibm.com/developerworks/aix/library/au-threadingpython/ – BenDundee Nov 21 '13 at 01:03

1 Answers1

13

Since you're using python 3.3, I'll recommend a python3-only stdlib solution: concurrent.futures.

This is a higher-level interace than just dealing directly with threading or multiprocessing primitives. You get an Executor interface to handle pooling and asynchronous reporting.

The docs have an example that is basically directly applicable to your situation, so I'll just drop it here:

import concurrent.futures
import urllib.request

URLS = #[some list of urls]

# Retrieve a single page and report the url and contents
def load_url(url, timeout):
    conn = urllib.request.urlopen(url, timeout=timeout)
    return conn.readall()

# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
    # Start the load operations and mark each future with its URL
    future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
    for future in concurrent.futures.as_completed(future_to_url):
        url = future_to_url[future]
        try:
            data = future.result() 
            # do json processing here
        except Exception as exc:
            print('%r generated an exception: %s' % (url, exc))
        else:
            print('%r page is %d bytes' % (url, len(data)))

You can replace urllib.request calls with requests calls, if you so desire. I do tend to like requests more, for obvious reasons.

The API goes a little bit like this: make a bunch of Future objects that represent the asynchronous execution of your function. You then use concurrent.futures.as_completed to give you an iterator over your Future instances. It will yield them as they are completed.

As for your question:

Also, is there a rule of thumb to figure out the optimal number of threads as a function of the number of requests, is there any?

Rule of thumb, no. It depends on too many things, including the speed of your internet connection. I will say it doesn't really depend on the number of requests you have, more on the hardware you're running on.

Fortunately it is quite easy to tweak up the max_workers kwarg and test for yourself. Start at 5 or 10 threads, ramp up in increments of 5. You'll probably notice performance plateauing at some point, and then start to decrease as the overhead of adding additional threads overtakes the marginal gain of increased parallelization (which is a word).

roippi
  • 25,533
  • 4
  • 48
  • 73
  • Let me say, there _are_ limitations about open threads that I've run into before on our AWS machines, but not on my laptop. The issue is outlined here: http://www.alak.cc/2011/11/python-threaderror-cant-start-new.html – BenDundee Nov 21 '13 at 14:44
  • @roippi have you had a look at requests-futures module I reverenced in my original post? It implements pretty much the same code. – Nikolay Derkach Nov 22 '13 at 09:08
  • @NikolayDerkach no I haven't, but looking at it.. huh! It wraps basically the above into one API call, which is quite nice. The one problem with that is that if it is slow/misbehaving, you don't have any recourse for fine-tuning it. You can more easily instrument the above code when stuff goes wrong, for example. Anyway, good luck :) – roippi Nov 22 '13 at 09:27