72

I spent a whole day looking for the simplest possible multithreaded URL fetcher in Python, but most scripts I found are using queues or multiprocessing or complex libraries.

Finally I wrote one myself, which I am reporting as an answer. Please feel free to suggest any improvement.

I guess other people might have been looking for something similar.

martineau
  • 119,623
  • 25
  • 170
  • 301
Daniele B
  • 19,801
  • 29
  • 115
  • 173
  • 1
    just to add:in Python case, multithreading is not native to core due to GIL. – akshayb Apr 24 '13 at 18:38
  • It stills looks that fetching the URLs in parallel is faster than doing it serially. Why is that? is it due to the fact that (I assume) the Python interpreter is not running continuously during an HTTP request? – Daniele B Apr 25 '13 at 01:01
  • What about if I want to parse the content of those web pages I fetch? Is it better to do the parsing within each thread, or should I do it sequentially after joining the worker threads to the main thread? – Daniele B Apr 25 '13 at 01:02

5 Answers5

54

Simplifying your original version as far as possible:

import threading
import urllib2
import time

start = time.time()
urls = ["http://www.google.com", "http://www.apple.com", "http://www.microsoft.com", "http://www.amazon.com", "http://www.facebook.com"]

def fetch_url(url):
    urlHandler = urllib2.urlopen(url)
    html = urlHandler.read()
    print "'%s\' fetched in %ss" % (url, (time.time() - start))

threads = [threading.Thread(target=fetch_url, args=(url,)) for url in urls]
for thread in threads:
    thread.start()
for thread in threads:
    thread.join()

print "Elapsed Time: %s" % (time.time() - start)

The only new tricks here are:

  • Keep track of the threads you create.
  • Don't bother with a counter of threads if you just want to know when they're all done; join already tells you that.
  • If you don't need any state or external API, you don't need a Thread subclass, just a target function.
abarnert
  • 354,177
  • 51
  • 601
  • 671
  • 4
    I made sure to claim that this was simplified "as far as possible", because that's the best way to make sure someone clever comes along and finds a way to simplify it even further just to make me look silly. :) – abarnert Apr 24 '13 at 01:51
  • 1
    I believe it's not easy to beat that! :-) it's a great improvement since the first version I published here – Daniele B Apr 24 '13 at 01:56
  • 1
    maybe we can combine the first 2 loops into one? by instantiating and starting the threads in the same `for` loop? – Daniele B Apr 24 '13 at 03:40
  • 1
    @DanieleB: Well, then you have to change the list comprehension into an explicit loop around `append`, like [this](http://pastebin.com/eYe7MCKn). Or, alternatively, write a wrapper which creates, starts, and returns a thread, like [this](http://pastebin.com/pVLSiNW2). Either way, I think it's less simple (although the second one is a useful way to refactor complicated cases, it doesn't work when things are already simple). – abarnert Apr 24 '13 at 18:05
  • 1
    @DanieleB: In a different language, however, you could do that. If `thread.start()` returned the thread, you could put the creation and start together into a single expression. In C++ or JavaScript, you'd probably do that. The problem is that, while method chaining and other "fluent programming" techniques make things more concise, they can also breaks down the expression/statement boundary, and are often ambiguous. so Python goes in almost the exact opposite direction, and almost _no_ methods or operators return the object they operate on. See http://en.wikipedia.org/wiki/Fluent_interface. – abarnert Apr 24 '13 at 18:07
  • I'm new to Python and am wondering how do you limit this to 8 workers for example, at a time? – Rachelle Uy Mar 09 '14 at 07:08
  • @RachelleUy: you could use a thread pool as show in [my answer (pass 8 instead of 20)](http://stackoverflow.com/a/27986480/4279) – jfs Dec 02 '16 at 14:15
  • This answer is amazing! I used this in a project I am developing for fun, which captures the Dividend Yield history from all shares of the stock market. – Victor Apr 24 '20 at 07:10
44

multiprocessing has a thread pool that doesn't start other processes:

#!/usr/bin/env python
from multiprocessing.pool import ThreadPool
from time import time as timer
from urllib2 import urlopen

urls = ["http://www.google.com", "http://www.apple.com", "http://www.microsoft.com", "http://www.amazon.com", "http://www.facebook.com"]

def fetch_url(url):
    try:
        response = urlopen(url)
        return url, response.read(), None
    except Exception as e:
        return url, None, e

start = timer()
results = ThreadPool(20).imap_unordered(fetch_url, urls)
for url, html, error in results:
    if error is None:
        print("%r fetched in %ss" % (url, timer() - start))
    else:
        print("error fetching %r: %s" % (url, error))
print("Elapsed Time: %s" % (timer() - start,))

The advantages compared to Thread-based solution:

  • ThreadPool allows to limit the maximum number of concurrent connections (20 in the code example)
  • the output is not garbled because all output is in the main thread
  • errors are logged
  • the code works on both Python 2 and 3 without changes (assuming from urllib.request import urlopen on Python 3).
jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • I have a question regarding the code: does the print in the fourth line from the bottom really return the time it took to fetch the url or the time it takes to return the url from the 'results' object? In my understanding the timestamp should be printed in the fetch_url() function, not in the result printing part. – Uwe Ziegenhagen Jul 09 '16 at 11:41
  • 1
    @UweZiegenhagen `imap_unordered()` returns the result as soon as it is ready. I assume the overhead is negligible compared to the time it takes to make the http request. – jfs Jul 09 '16 at 13:43
  • Thank you, I am using it in a modified form to compile LaTeX files in parallel: http://uweziegenhagen.de/?p=3501 – Uwe Ziegenhagen Jul 09 '16 at 16:41
  • 2
    This is by far the best, fastest and simplest way to go. I have been trying twisted, scrapy and others using both python 2 and python 3, and this is simpler and better – UriCS Nov 07 '16 at 09:04
  • Thanks! Is there a way to add a delay between the calls? – stallingOne May 17 '17 at 09:32
  • @stallingone: yes. You could [use something like RatedSemaphore](https://stackoverflow.com/a/16686329/4279) – jfs Jun 15 '17 at 22:11
  • After getting the `results`, should we call `.join()` or `.terminate()` to terminate the processes? Or we don't need to do that for `ThreadPool` ? – chengcj Oct 07 '17 at 07:09
  • @chengcj: ThreadPool() does not start new processes (as it is said explicitly in the answer). Usually, you create a pool and it lives as long as your Python script lives (as in the example). You shouldn't create and tear down the pool constantly in a loop (the whole point of using a pool is to maintain a pool of threads/processes that are ready to do the work without starting/shutting down them). In other words, do nothing but if you know you need it, you could call `.terminate()` or use `with`-statement: `with Pool() as pool: ...`. – jfs Oct 07 '17 at 07:39
  • Is this a scalable solution? I mean if multiple users do this does not it break the database server?(micro services) – Quantum Dreamer Aug 26 '19 at 23:19
  • @JohnRuby the code allows you to make 20 concurrent web requests. If your purpose is to harm a server, there are much more effective solutions. – jfs Sep 06 '19 at 18:44
  • @jfs I am talking in a big data context. To load a report of 100,00 rows,usually data will be paginated. However sending parallel requests with create multiple connections to the database and ......... – Quantum Dreamer Mar 25 '20 at 03:08
  • @QuantumDreamer: a single query may break a service while another service may accept millions concurrent users. It depends on service. It has nothing to do with the answer. It is up to you to know what&how many request your service can/should accept. – jfs Mar 25 '20 at 16:51
21

The main example in the concurrent.futures does everything you want, a lot more simply. Plus, it can handle huge numbers of URLs by only doing 5 at a time, and it handles errors much more nicely.

Of course this module is only built in with Python 3.2 or later… but if you're using 2.5-3.1, you can just install the backport, futures, off PyPI. All you need to change from the example code is to search-and-replace concurrent.futures with futures, and, for 2.x, urllib.request with urllib2.

Here's the sample backported to 2.x, modified to use your URL list and to add the times:

import concurrent.futures
import urllib2
import time

start = time.time()
urls = ["http://www.google.com", "http://www.apple.com", "http://www.microsoft.com", "http://www.amazon.com", "http://www.facebook.com"]

# Retrieve a single page and report the url and contents
def load_url(url, timeout):
    conn = urllib2.urlopen(url, timeout=timeout)
    return conn.readall()

# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
    # Start the load operations and mark each future with its URL
    future_to_url = {executor.submit(load_url, url, 60): url for url in urls}
    for future in concurrent.futures.as_completed(future_to_url):
        url = future_to_url[future]
        try:
            data = future.result()
        except Exception as exc:
            print '%r generated an exception: %s' % (url, exc)
        else:
            print '"%s" fetched in %ss' % (url,(time.time() - start))
print "Elapsed Time: %ss" % (time.time() - start)

But you can make this even simpler. Really, all you need is:

def load_url(url):
    conn = urllib2.urlopen(url, timeout)
    data = conn.readall()
    print '"%s" fetched in %ss' % (url,(time.time() - start))
    return data
    
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
    pages = executor.map(load_url, urls)

print "Elapsed Time: %ss" % (time.time() - start)
MestreLion
  • 12,698
  • 8
  • 66
  • 57
abarnert
  • 354,177
  • 51
  • 601
  • 671
2

I am now publishing a different solution, by having the worker threads not-deamon and joining them to the main thread (which means blocking the main thread until all worker threads have finished) instead of notifying the end of execution of each worker thread with a callback to a global function (as I did in the previous answer), as in some comments it was noted that such way is not thread-safe.

import threading
import urllib2
import time

start = time.time()
urls = ["http://www.google.com", "http://www.apple.com", "http://www.microsoft.com", "http://www.amazon.com", "http://www.facebook.com"]

class FetchUrl(threading.Thread):
    def __init__(self, url):
        threading.Thread.__init__(self)
        self.url = url

    def run(self):
        urlHandler = urllib2.urlopen(self.url)
        html = urlHandler.read()
        print "'%s\' fetched in %ss" % (self.url,(time.time() - start))

for url in urls:
    FetchUrl(url).start()

#Join all existing threads to main thread.
for thread in threading.enumerate():
    if thread is not threading.currentThread():
        thread.join()

print "Elapsed Time: %s" % (time.time() - start)
Daniele B
  • 19,801
  • 29
  • 115
  • 173
  • This will work, but it isn't the way you want to do it. If a later version of your program creates any other threads (daemon, or joined by some other code), it will break. Also, `thread is threading.currentThread()` isn't guaranteed to work (I think it always will for any CPython version so far, on any platform with real threads, if used in the main thread… but still, better not to assume). Safer to store all the `Thread` objects in a list (`threads = [FetchUrl(url) for url in urls]`), then start them, then join them with `for thread in threads: thread.join()`. – abarnert Apr 24 '13 at 01:19
  • Also, for simple cases like this, you can simplify it even farther: Don't bother creating a `Thread` subclass unless you have some kind of state to store or some API to interact with the threads from outside, just write a simple function, and do `threading.Thread(target=my_thread_function, args=[url])`. – abarnert Apr 24 '13 at 01:22
  • do you mean that if I have the same script running twice at the same time on the same machine 'for thread in threading.enumerate():' would include the threads of both executions? – Daniele B Apr 24 '13 at 01:25
  • See http://pastebin.com/Z5MdeB5x, which I think is about as simple as you're going to get for an explicit-threaded URL-fetcher. – abarnert Apr 24 '13 at 01:25
  • `threading.enumerate()` only includes the threads in the current process, so running multiple copies of the same script in separate instances of Python running as separate process isn't a problem. It's just that if you later decide to expand on this code (or use it in some other project) you may have daemon threads created in another part of the code, or what's now the main code may even be code running in some background thread. – abarnert Apr 24 '13 at 01:27
  • cool, the `threading.enumerate()` explanation makes sense to me! thanks a lot for the code [pastebin.com/Z5MdeB5x](http://pastebin.com/Z5MdeB5x), if you paste it into a new answer, I will accept it the top answer! – Daniele B Apr 24 '13 at 01:32
-1

This script fetches the content from a set of URLs defined in an array. It spawns a thread for each URL to be fetch, so it is meant to be used for a limited set of URLs.

Instead of using a queue object, each thread is notifying its end with a callback to a global function, which keeps count of the number of threads running.

import threading
import urllib2
import time

start = time.time()
urls = ["http://www.google.com", "http://www.apple.com", "http://www.microsoft.com", "http://www.amazon.com", "http://www.facebook.com"]
left_to_fetch = len(urls)

class FetchUrl(threading.Thread):
    def __init__(self, url):
        threading.Thread.__init__(self)
        self.setDaemon = True
        self.url = url

    def run(self):
        urlHandler = urllib2.urlopen(self.url)
        html = urlHandler.read()
        finished_fetch_url(self.url)


def finished_fetch_url(url):
    "callback function called when a FetchUrl thread ends"
    print "\"%s\" fetched in %ss" % (url,(time.time() - start))
    global left_to_fetch
    left_to_fetch-=1
    if left_to_fetch==0:
        "all urls have been fetched"
        print "Elapsed Time: %ss" % (time.time() - start)


for url in urls:
    "spawning a FetchUrl thread for each url to fetch"
    FetchUrl(url).start()
Daniele B
  • 19,801
  • 29
  • 115
  • 173
  • I can see this being extremely useful! Thanks :) – Jason Sperske Apr 24 '13 at 00:00
  • 2
    It isn't thread-safe to modify shared globals without a lock. And it's _especially_ dangerous to do things like `urlsToFetch-=1`. Inside the interpreter, that compiles into three separate steps to load `urlsToFetch`, subtract one, and store `urlsToFetch`. If the interpreter switches threads between the load and the store, you'll end up with thread 1 loading a 2, then thread 2 loading the same 2, then thread 2 storing a 1, then thread 1 storing a 1. – abarnert Apr 24 '13 at 00:13
  • hi abarnert, thanks for your answer can you please suggest a solution for thread-safe? many thanks – Daniele B Apr 24 '13 at 00:17
  • You can put a `threading.Lock` around every access to the variable, or lots of other possibilities (use a counted semaphore instead of a plain integer, or use a barrier instead of counting explicitly, …), but you really don't need this global at all. Just `join` all the threads instead of daemonizing them, and it's done when you've joined them all. – abarnert Apr 24 '13 at 00:20
  • In fact… daemonizing the threads like this and then not waiting on anything means your program quits, terminating all of the worker threads, before most of them can finish. On a fastish MacBook Pro with a slowish network connection, I often don't get _any_ finished before it quits. – abarnert Apr 24 '13 at 00:21
  • And all of these fiddly details that are very easy to get disastrously wrong and hard to get right are exactly why you're better off using higher-level APIs like `concurrent.futures` whenever possible. – abarnert Apr 24 '13 at 00:23
  • @abarnert I am wondering in terms of CPU usage, is thread.join() equivalent to a while-loop? – Daniele B Apr 24 '13 at 00:52
  • No, not at all. `thread.join` is a blocking call—it waits without using any CPU until the OS tells it to wake up because `thread` has finished. – abarnert Apr 24 '13 at 00:57
  • Side note: You can use 'single quotes' instead of "double quotes" for strings, so you don't have to escape literal quote characters: `'"%s" fetched'` instead of `"\"%s\" fetched"`. (And if you need both single and double quotes in the same string, just use """triple double quotes""".) – abarnert Apr 24 '13 at 01:03
  • I just published a different answer with thread.join(). Any better? – Daniele B Apr 24 '13 at 01:04
  • thanks for the explanation about quoting, it seems the same as in mysql – Daniele B Apr 24 '13 at 01:09