What's the difference between python's multiprocessing and concurrent.futures?

Question

A simple way of implementing multiprocessing in python is

from multiprocessing import Pool

def calculate(number):
    return number

if __name__ == '__main__':
    pool = Pool()
    result = pool.map(calculate, range(4))

An alternative implementation based on futures is

from concurrent.futures import ProcessPoolExecutor

def calculate(number):
    return number

with ProcessPoolExecutor() as executor:
    result = executor.map(calculate, range(4))

Both alternatives do essentially the same thing, but one striking difference is that we don't have to guard the code with the usual if __name__ == '__main__' clause. Is this because the implementation of futures takes care of this or us there a different reason?

More broadly, what are the differences between multiprocessing and concurrent.futures? When is one preferred over the other?

EDIT: My initial assumption that the guard if __name__ == '__main__' is only necessary for multiprocessing was wrong. Apparently, one needs this guard for both implementations on windows, while it is not necessary on unix systems.

Erm. I *doubt* that you *don't need* the `if` guard. According to [the documentation](https://docs.python.org/dev/library/concurrent.futures.html#processpoolexecutor) `ProcessPoolExecutor` is built on top of `multiprocessing`, and as such it should suffer the same problem (otherwise the `multiprocessing` documentation would show how to avoid that guard, right?). In fact the example from the documentation **does** use the usual guard. — Bakuriu, Jul 22 '14 at 19:38
You're right. I got confused since it is only necessary on windows, apparently. I must admit that I only tested the futures on mac and thus found that the guard is not necessary. I'll add some note in the question emphasizing this. — David Zwicker, Jul 22 '14 at 19:55
One time I brought down a blade server by forgetting that guard :) — JamesHutchison, Feb 02 '17 at 17:58
See also http://stackoverflow.com/questions/20776189/concurrent-futures-vs-multiprocessing-in-python-3 — max, May 06 '17 at 17:07
Look like prefork model on Unix save you from that bit one should always have that 'if' line. Can anyone confirm? — James, Apr 09 '20 at 16:45

dano · Accepted Answer · 2020-05-29T13:45:05.693

You actually should use the if __name__ == "__main__" guard with ProcessPoolExecutor, too: It's using multiprocessing.Process to populate its Pool under the covers, just like multiprocessing.Pool does, so all the same caveats regarding picklability (especially on Windows), etc. apply.

I believe that ProcessPoolExecutor is meant to eventually replace multiprocessing.Pool, according to this statement made by Jesse Noller (a Python core contributor), when asked why Python has both APIs:

Brian and I need to work on the consolidation we intend(ed) to occur as people got comfortable with the APIs. My eventual goal is to remove anything but the basic multiprocessing.Process/Queue stuff out of MP and into concurrent.* and support threading backends for it.

For now, ProcessPoolExecutor is mostly doing the exact same thing as multiprocessing.Pool with a simpler (and more limited) API. If you can get away with using ProcessPoolExecutor, use that, because I think it's more likely to get enhancements in the long-term. Note that you can use all the helpers from multiprocessing with ProcessPoolExecutor, like Lock, Queue, Manager, etc., so needing those isn't a reason to use multiprocessing.Pool.

There are some notable differences in their APIs and behavior though:

If a Process in a ProcessPoolExecutor terminates abruptly, a BrokenProcessPool exception is raised, aborting any calls waiting for the pool to do work, and preventing new work from being submitted. If the same thing happens to a multiprocessing.Pool it will silently replace the process that terminated, but the work that was being done in that process will never be completed, which will likely cause the calling code to hang forever waiting for the work to finish.
If you are running Python 3.6 or lower, support for initializer/initargs is missing from ProcessPoolExecutor. Support for this was only added in 3.7).
There is no support in ProcessPoolExecutor for maxtasksperchild.
concurrent.futures doesn't exist in Python 2.7, unless you manually install the backport.
If you're running below Python 3.5, according to this question, multiprocessing.Pool.map outperforms ProcessPoolExecutor.map. Note that the performance difference is very small per work item, so you'll probably only notice a large performance difference if you're using map on a very large iterable. The reason for the performance difference is that multiprocessing.Pool will batch the iterable passed to map into chunks, and then pass the chunks to the worker processes, which reduces the overhead of IPC between the parent and children. ProcessPoolExecutor always (or by default, starting in 3.5) passes one item from the iterable at a time to the children, which can lead to much slower performance with large iterables, due to the increased IPC overhead. The good news is this issue is fixed in Python 3.5, as the chunksize keyword argument has been added to ProcessPoolExecutor.map, which can be used to specify a larger chunk size when you know you're dealing with large iterables. See this bug for more info.

From the current [source](https://github.com/python/cpython/blob/3.7/Lib/concurrent/futures/process.py#L173) for ProcessPoolExecutor.map , using the chunksize > 1, it looks like tuples will be sent to the function so the function needs to be able to handle tuples of items rather than single items. Do you think I interpreted that correctly? — wwii, Jan 07 '19 at 18:02
@wwii The tuple returned by that function is processed by the `_process_chunk` method, which pulls each entry in the tuple out, and passes it to the mapping function the user provided. So the user doesn't have to change anything when they use a chunksize > 1. — dano, Jan 07 '19 at 19:08
@Jay Nope, both deficiencies have been addressed. `chunksize` was added to `map` in 3.5, and `initializer`/`initargs` was added in 3.7. — dano, May 29 '20 at 13:29

score 3 · Answer 2 · answered Jul 22 '14 at 19:37

3

if __name__ == '__main__': just means that you invoked the script on the command prompt using python <scriptname.py> [options] instead of import <scriptname> in the python shell.

When you invoke a script from the command prompt, the __main__ method gets called. In the second block, the

with ProcessPoolExecutor() as executor:
    result = executor.map(calculate, range(4))

block is executed regardless of whether it was invoked from the command prompt or imported from the shell.

answered Jul 22 '14 at 19:37

1

Actually, one **needs** to protect the `__main__` of a `multiprocessing` script on Windows, as the main body is re-executed in the child processes. – Antti Haapala -- Слава Україні Jul 22 '14 at 19:39
Aah, in that case I misunderstood the question. – Jul 22 '14 at 19:40

What's the difference between python's multiprocessing and concurrent.futures?

2 Answers2

Linked