Python multiprocessing - Why is using functools.partial slower than default arguments?

Question

Consider the following function:

def f(x, dummy=list(range(10000000))):
    return x

If I use multiprocessing.Pool.imap, I get the following timings:

import time
import os
from multiprocessing import Pool

def f(x, dummy=list(range(10000000))):
    return x

start = time.time()
pool = Pool(2)
for x in pool.imap(f, range(10)):
    print("parent process, x=%s, elapsed=%s" % (x, int(time.time() - start)))

parent process, x=0, elapsed=0
parent process, x=1, elapsed=0
parent process, x=2, elapsed=0
parent process, x=3, elapsed=0
parent process, x=4, elapsed=0
parent process, x=5, elapsed=0
parent process, x=6, elapsed=0
parent process, x=7, elapsed=0
parent process, x=8, elapsed=0
parent process, x=9, elapsed=0

Now if I use functools.partial instead of using a default value:

import time
import os
from multiprocessing import Pool
from functools import partial

def f(x, dummy):
    return x

start = time.time()
g = partial(f, dummy=list(range(10000000)))
pool = Pool(2)
for x in pool.imap(g, range(10)):
    print("parent process, x=%s, elapsed=%s" % (x, int(time.time() - start)))

parent process, x=0, elapsed=1
parent process, x=1, elapsed=2
parent process, x=2, elapsed=5
parent process, x=3, elapsed=7
parent process, x=4, elapsed=8
parent process, x=5, elapsed=9
parent process, x=6, elapsed=10
parent process, x=7, elapsed=10
parent process, x=8, elapsed=11
parent process, x=9, elapsed=11

Why is the version using functools.partial so much slower?

Why are you using `list(range(...))`? AFAIK your code would do exactly the same thing without the call to `list`, except that the problem explained by ShadowRanger wouldn't occur and the overhead of pickling would be *much much* smaller. — Bakuriu, Jan 28 '16 at 13:32
Side-note: Using `list`s (or any other mutable type) as default (or `partial` bound) arguments is dangerous, since the _same_ `list` is shared between all default invocations of the function, not a fresh copy for each call; usually, you want the fresh copy. — ShadowRanger, Jan 28 '16 at 13:33
as aside note, is usually bad idea using mutable object as default values because if you modify it in the function every subsequent invocation to the function is going to see the changes — Copperfield, Jan 28 '16 at 13:33
@Bakuriu: I think this is just a minimal example to demonstrate the discrepancy, not real code. Which is appreciated; getting a giant dump of someone's project and no indication that they've attempted to suss out the problem is a royal PITA. — ShadowRanger, Jan 28 '16 at 13:34

ShadowRanger · Accepted Answer · 2016-01-28T20:25:03.363

Using multiprocessing requires sending the worker processes information about the function to run, not just the arguments to pass. That information is transferred by pickling that information in the main process, sending it to the worker process, and unpickling it there.

This leads to the primary issue:

Pickling a function with default arguments is cheap; it only pickles the name of the function (plus the info to let Python know it's a function); the worker processes just look up the local copy of the name. They already have a named function f to find, so it costs almost nothing to pass it.

But pickling a partial function involves pickling the underlying function (cheap) and all the default arguments (expensive when the default argument is a 10M long list). So every time a task is dispatched in the partial case, it's pickling the bound argument, sending it to the worker process, the worker process unpickles, then finally does the "real" work. On my machine, that pickle is roughly 50 MB in size, which is a huge amount of overhead; in quick timing tests on my machine, pickling and unpickling a 10 million long list of 0 takes about 620 ms (and that's ignoring the overhead of actually transferring the 50 MB of data).

partials have to pickle this way, because they don't know their own names; when pickling a function like f, f (being def-ed) knows its qualified name (in an interactive interpreter or from the main module of a program, it's __main__.f), so the remote side can just recreate it locally by doing the equivalent of from __main__ import f. But the partial doesn't know its name; sure, you assigned it to g, but neither pickle nor the partial itself know it available with the qualified name __main__.g; it could be named foo.fred or a million other things. So it has to pickle the info necessary to recreate it entirely from scratch. It's also pickle-ing for each call (not just once per worker) because it doesn't know that the callable isn't changing in the parent between work items, and it's always trying to ensure it sends up to date state.

You have other issues (timing creation of the list only in the partial case and the minor overhead of calling a partial wrapped function vs. calling the function directly), but those are chump change relative to the per-call overhead pickling and unpickling the partial is adding (the initial creation of the list is adding one-time overhead of a little under half what each pickle/unpickle cycle costs; the overhead to call through the partial is less than a microsecond).

1) If the default argument `dummy` is not pickled, then how is it sent to the worker? It is not a global variable, is it? 2) With the `partial`, each function call is expensive. Does it mean that `g` gets (re)pickled for each function call? — usual me, Jan 30 '16 at 06:13
@usualme: #1: On Linux, the workers are forked from the parent, so they already have their own copy of the function in their own memory space (it's copy-on-write, so they may actually be sharing pages with the parent for a while). And their copy already has the same default argument initialized, so when they look up the same function by qualified name, it comes already set up. On Windows, Python simulates fork by running `__main__` w/o running it as if it were being run as the main module; if the function is imported in `__main__`, the cost to make the list is paid once per worker, not task. — ShadowRanger, Jan 30 '16 at 06:21
@usualme: #2: Yup, the `Pool` is generic, and there is no guarantee that worker processes won't die and be replaced, that the process of launching and receiving results from the workers won't mutate the callable passed to `imap`, that any given worker has even received work yet, or that other tasks using different callables might not be interspersed, etc. So both callable and arguments are serialized for dispatch on every individual task, not just once per worker. Usually, callables are fairly cheap to serialize, this is one of those pathological cases that's the exception to the general rule. — ShadowRanger, Jan 30 '16 at 06:23

Python multiprocessing - Why is using functools.partial slower than default arguments?

1 Answers1

Linked