Since you're using python 3.3, I'll recommend a python3-only stdlib solution: concurrent.futures
.
This is a higher-level interace than just dealing directly with threading
or multiprocessing
primitives. You get an Executor
interface to handle pooling and asynchronous reporting.
The docs have an example that is basically directly applicable to your situation, so I'll just drop it here:
import concurrent.futures
import urllib.request
URLS = #[some list of urls]
# Retrieve a single page and report the url and contents
def load_url(url, timeout):
conn = urllib.request.urlopen(url, timeout=timeout)
return conn.readall()
# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
# Start the load operations and mark each future with its URL
future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
try:
data = future.result()
# do json processing here
except Exception as exc:
print('%r generated an exception: %s' % (url, exc))
else:
print('%r page is %d bytes' % (url, len(data)))
You can replace urllib.request
calls with requests
calls, if you so desire. I do tend to like requests
more, for obvious reasons.
The API goes a little bit like this: make a bunch of Future
objects that represent the asynchronous execution of your function. You then use concurrent.futures.as_completed
to give you an iterator over your Future
instances. It will yield them as they are completed.
As for your question:
Also, is there a rule of thumb to figure out the optimal number of
threads as a function of the number of requests, is there any?
Rule of thumb, no. It depends on too many things, including the speed of your internet connection. I will say it doesn't really depend on the number of requests you have, more on the hardware you're running on.
Fortunately it is quite easy to tweak up the max_workers
kwarg and test for yourself. Start at 5 or 10 threads, ramp up in increments of 5. You'll probably notice performance plateauing at some point, and then start to decrease as the overhead of adding additional threads overtakes the marginal gain of increased parallelization (which is a word).