A ThreadPoolExecutor inside a ProcessPoolExecutor

Question

I am new to the futures module and have a task that could benefit from parallelization; but I don't seem to be able to figure out exactly how to setup the function for a thread and the function for a process. I'd appreciate any help anyone can shed on the matter.

I'm running a particle swarm optimization (PSO). Without getting into too much detail about PSO itself, here's the basic layout of my code:

There is a Particle class, with a getFitness(self) method (which computes some metric and stores it in self.fitness). A PSO simulation has multiple particle instances (easily over 10; 100s or even 1000s for some simulations).
Every so often, I have to compute the fitness of the particles. Currently, I do this in for-loop:

for p in listOfParticles:
  p.getFitness(args)

However, I notice that the fitness of each particle can be computed independently of each other. This makes this fitness computation a prime candidate for parallelization. Indeed, I could do map(lambda p: p.getFitness(args), listOfParticles).

Now, I can easily do this with futures.ProcessPoolExecutor:

with futures.ProcessPoolExecutor() as e:
  e.map(lambda p: p.getFitness(args), listOfParticles)

Since the side-effects of calling p.getFitness are stored in each particle itself, I don't have to worry about getting a return from futures.ProcessPoolExecutor().

So far, so good. But now I notice that ProcessPoolExecutor creates new processes, which means that it copies memory, which is slow. I'd like to be able to share memory - so I should be using threads. That's well and good, until I realize that running several processes with several threads inside each process will likely be faster, since multiple threads still run only on one processor of my sweet, 8-core machine.

Here's where I run into trouble:
Based on the examples I've seen, ThreadPoolExecutor operates on a list. So does ProcessPoolExecutor. So I can't do anything iterative in ProcessPoolExecutor to farm out to ThreadPoolExecutor because then ThreadPoolExecutor is going to get a single object to work on (see my attempt, posted below).
On the other hand, I cant slice listOfParticles myself, because I want ThreadPoolExecutor to do its own magic to figure out how many threads are required.

So, the big question (at long last):
How should I structure my code so that I can effectively parallelize the following using both processes AND threads:

for p in listOfParticles:
  p.getFitness()

This is what I've been trying, but I wouldn't dare try to run it, for I know it won't work:

>>> def threadize(func, L, mw):
...     with futures.ThreadpoolExecutor(max_workers=mw) as executor:
...             for i in L:
...                     executor.submit(func, i)
... 

>>> def processize(func, L, mw):
...     with futures.ProcessPoolExecutor() as executor:
...             executor.map(lambda i: threadize(func, i, mw), L)
...

I'd appreciate any thoughts on how to fix this, or even on how to improve my approach

In case it matters, I'm on python3.3.2

What kind of code does `getFitness()` run? The problem with threads in CPython is that they're only suitable for I/O-bound tasks, because CPython has a global interpreter lock ("the GIL") that only *allows* one thread to run at a time. If, e.g., `getFitness()` runs CPU-bound Python code, the GIL will make threading run *slower* than not threading (threading just adds additional overhead for context switching then). But if, e.g., `getFitness()` runs an extension module function that releases the GIL, then threading may help (e.g., many `numpy` functions release the GIL). — Tim Peters, Nov 24 '13 at 05:03
`getFitness` interprets the information encoded in a particle as the starting semantics of a neural network, runs the neural resulting neural network, and computes the output error (this output error is the fitness - well, its inverse, really). As such, I believe that this function would be more CPU bound than I/O bound (I've done all the neural network stuff from scratch and it's all lists of classes, and multiplications thereof). So perhaps threads won't be of too much help in this exact situation, but I'd still like to be able to use a ThreadPool in a ProcessPool for applicable problems — inspectorG4dget, Nov 24 '13 at 20:30

Tim Peters · Accepted Answer · 2013-11-25T02:47:28.273

I'll give you working code that mixes processes with threads for solving the problem, but it's not what you're expecting ;-) First thing is to make a mock program that doesn't endanger your real data. Experiment with something harmless. So here's the start:

class Particle:
    def __init__(self, i):
        self.i = i
        self.fitness = None
    def getfitness(self):
        self.fitness = 2 * self.i

Now we have something to play with. Next some constants:

MAX_PROCESSES = 3
MAX_THREADS = 2 # per process
CHUNKSIZE = 100

Fiddle those to taste. CHUNKSIZE will be explained later.

The first surprise for you is what my lowest-level worker function does. That's because you're overly optimistic here:

Since the side-effects of calling p.getFitness are stored in each particle itself, I don't have to worry about getting a return from futures.ProcessPoolExecutor().

Alas, nothing done in a worker process can have any effect on the Particle instances in your main program. A worker process works on copies of Particle instances, whether via a copy-on-write implementation of fork() or because it's working on a copy made from unpickling a Particle pickle passed across processes.

So if you want your main program to see the fitness results, you need to arrange to send information back to the main program. Because I don't know enough about your actual program, here I'm assuming that Particle().i is a unique integer, and that the main program can easily map integers back to Particle instances. With that in mind, the lowest-level worker function here needs to return a pair: the unique integer and the fitness result:

def thread_worker(p):
    p.getfitness()
    return (p.i, p.fitness)

Given that, it's easy to spread a list of Particles across threads, and return a list of (particle_id, fitness) results:

def proc_worker(ps):
    import concurrent.futures as cf
    with cf.ThreadPoolExecutor(max_workers=MAX_THREADS) as e:
        result = list(e.map(thread_worker, ps))
    return result

Notes:

That's the function each worker process will run.
I'm using Python 3, so use list() to force e.map() to materialize all the results in a list.
As mentioned in a comment, under CPython spreading CPU-bound tasks across threads is slower than doing them all in a single thread.

It only remains to write code to spread a list of Particles across processes, and retrieve the results. This is dead easy to do with multiprocessing, so that's what I'm going to use. I have no idea whether concurrent.futures can do it (given that we're also mixing in threads), but don't care. But because I'm giving you working code, you can play with that and report back ;-)

if __name__ == "__main__":
    import multiprocessing

    particles = [Particle(i) for i in range(100000)]
    # Note the code below relies on that particles[i].i == i
    assert all(particles[i].i == i for i in range(len(particles)))

    pool = multiprocessing.Pool(MAX_PROCESSES)
    for result_list in pool.imap_unordered(proc_worker,
                      (particles[i: i+CHUNKSIZE]
                       for i in range(0, len(particles), CHUNKSIZE))):
        for i, fitness in result_list:
            particles[i].fitness = fitness

    pool.close()
    pool.join()

    assert all(p.fitness == 2*p.i for p in particles)

Notes:

I'm breaking the list of Particles into chunks "by hand". That's what CHUNKSIZE is for. That's because a worker process wants a list of Particles to work on, and in turn that's because that's what the futures map() function wants. It's a Good Idea to chunk up work regardless, so you get some real bang for the buck in return for the per-invocation interprocess overheads.
imap_unordered() makes no guarantees about the order in which results are returned. That gives the implementation more freedom to arrange work as efficiently as possible. And we don't care about the order here, so that's fine.
Note that the loop retrieves the (particle_id, fitness) results, and modifies the Particle instances accordingly. Perhaps your real .getfitness makes other mutations to Particle instances - can't guess. Regardless, the main program will never see any mutations made in workers "by magic" - you have to explicitly arrange for that. In the limit, you could return (particle_id, particle_instance) pairs instead, and replace the Particle instances in the main program. Then they'd reflect all mutations made in worker processes.

Have fun :-)

Futures all the way down

Turns out it was very easy to replace multiprocessing. Here are the changes. This also (as mentioned earlier) replaces the original Particle instances, so as to capture all mutations. There's a tradeoff here, though: pickling an instance requires "a lot more" bytes than pickling a single "fitness" result. More network traffic. Pick your poison ;-)

Returning the mutated instance just requires replacing the last line of thread_worker(), like so:

return (p.i, p)

Then replace all of the "main" block with this:

def update_fitness():
    import concurrent.futures as cf
    with cf.ProcessPoolExecutor(max_workers=MAX_PROCESSES) as e:
        for result_list in e.map(proc_worker,
                      (particles[i: i+CHUNKSIZE]
                       for i in range(0, len(particles), CHUNKSIZE))):
            for i, p in result_list:
                particles[i] = p

if __name__ == "__main__":
    particles = [Particle(i) for i in range(500000)]
    assert all(particles[i].i == i for i in range(len(particles)))

    update_fitness()

    assert all(particles[i].i == i for i in range(len(particles)))
    assert all(p.fitness == 2*p.i for p in particles)

The code is very similar to the multiprocessor dance. Personally, I'd use the multiprocessing version, because imap_unordered is valuable. That's a problem with simplified interfaces: they often buy simplicity at the cost of hiding useful possibilities.

You're welcome :-) See the edit just now: `multiprocessing` isn't really *needed* after all. — Tim Peters, Nov 25 '13 at 02:48
When use ProcessPoolExecutor() over ThreadPoolExecutor() or visa versa for doing parallelism? As described [over here](http://www.drdobbs.com/parallel/python-32-beta-1-introduces-new-concurre/229300338), you can even combine it using the 'Future' object :S? — Melroy van den Berg, Apr 03 '14 at 20:52
@TimPeters Is there a reason why you import concurrent.futures inside the function? — Al Guy, Aug 11 '20 at 16:51
@TimPeters May I ask you to have a look at my question? https://stackoverflow.com/questions/63306875/combining-multithreading-and-multiprocessing-with-concurrent-futures — Al Guy, Aug 11 '20 at 16:52

score 4 · Answer 2 · edited May 23 '17 at 11:44

First, are you sure to leverage from running multiple thread while loading all your cores with processes? If it is cpu-bound, hardly yes. At least some tests has to be made.

If adding threads leverage your performance, the next question is whether one can achive better performance with hand-made load balancing, or automatic. By hand-made I mean careful workload partitioning into chunks of similar computational complexity and instatiating a new task processor per chunk, your orinal but doubted solution. By automatic, creation of pool of processes/threads and communication on work queue for new tasks, that one you strive for. In my view, first approach is one of Apache Hadoop paradigm, second is implemented by works queue processors, such as Celery. First approach may suffer from some tasks chunks being slower and running while others completed, second adds commutication and waiting-on-task overheads, and this is second point of performance tests to be made.

Last, if you wish to have a static collection of processes with multithreads within, AFAIK, you can't achive it with concurrent.futures as is, and have to modify it a bit. I don't know, whether there are existing solutions for this task, but as concurrent is a pure python solution (with no C code), it can easely be done. Work processor is defined in _adjust_process_count routine of ProcessPoolExecutor class, and subclassing and overriding it with multi-threaded approach is rather straigtforward, you just have to supply your custom _process_worker, based on concurrent.features.thread

Original ProcessPoolExecutor._adjust_process_count for reference:

def _adjust_process_count(self):
    for _ in range(len(self._processes), self._max_workers):
        p = multiprocessing.Process(
                target=_process_worker,
                args=(self._call_queue,
                      self._result_queue))
        p.start()
        self._processes[p.pid] = p

I'd prefer to go with the automatic load balancing. This is because distribution, though helpful to my simulation is not of paramount importance. So, what I'm trying to do is to achieve better efficiency with minimal programming effort. But, to your first point, why am I unlikely to improve performance of a CPU bound task with several processes and several threads per process? — inspectorG4dget, Nov 15 '13 at 08:22
@inspectorG4dget it's hard to say without actual `getFitness` code inspection, CPU arch and commands used, and depends on many factors, but main reason would be CPU context switching, CPU cache misses etc. Did you manage to write a multiprocess/multithread Excecutor overriding adjust function or need more help? — alko, Nov 16 '13 at 13:55
I didn't get around to writing an over overrriding adjust function. That's a little out of my league at the moment, I fear. BUt more importantly, I'm not looking for the absolute best solution. I'm looking to get some speed up for minimal effort, so I don't mind a suboptimal solution, as long as it's still better than a single, single-threaded process — inspectorG4dget, Nov 16 '13 at 21:26

Asclepius · Answer 3 · 2019-09-25T01:12:20.923

This is a generalized answer that leverages the threadedprocess package which implements ThreadedProcesPoolExecutor, allowing a combined use of a thread pool inside a process pool. Below is a somewhat general-purpose utility function which uses it:

import concurrent.futures
import logging
from typing import Callable, Iterable, Optional

import threadedprocess

log = logging.getLogger(__name__)


def concurrently_execute(fn: Callable, fn_args: Iterable, max_processes: Optional[int] = None, max_threads_per_process: Optional[int] = None) -> None:
    """Execute the given callable concurrently using multiple threads and/or processes."""
    # Ref: https://stackoverflow.com/a/57999709/
    if max_processes == 1:
        executor = concurrent.futures.ThreadPoolExecutor(max_workers=max_threads_per_process)
    elif max_threads_per_process == 1:
        executor = concurrent.futures.ProcessPoolExecutor(max_workers=max_processes)  # type: ignore
    else:
        executor = threadedprocess.ThreadedProcessPoolExecutor(max_processes=max_processes, max_threads=max_threads_per_process)

    if max_processes and max_threads_per_process:
        max_workers = max_processes * max_threads_per_process
        log.info("Using %s with %s processes and %s threads per process, i.e. with %s workers.", executor.__class__.__name__, max_processes, max_threads_per_process, max_workers)

    with executor:
        futures = [executor.submit(fn, *fn_args_cur) for fn_args_cur in fn_args]

    for future in concurrent.futures.as_completed(futures):
        future.result()  # Raises exception if it occurred in process worker.

A ThreadPoolExecutor inside a ProcessPoolExecutor

3 Answers3

Futures all the way down

Linked