Python core usage slower/under 100% with multiprocessing.Pool

Question

Code that runs on one core @ 100% actually runs slower when multiprocessed, where it runs on several cores @ ~50%.

This question is asked frequently, and the best threads I've found about it (0, 1) give the answer, "It's because the workload isn't heavy enough, so the inter-process communication (IPC) overhead ends up making things slower."

I don't know whether or not this is right, but I've isolated an example where this happens AND doesn't happen for the same workload, and I want to know whether this answer still applies or why it actually happens:

from multiprocessing import Pool

def f(n):
    res = 0

    for i in range(n):
        res += i**2

    return res


def single(n):
    """ Single core """
    for i in range(n):
        f(n)


def multi(n):
    """ Multi core """
    pool = Pool(2)

    for i in range(n):
        pool.apply_async(f, (n,))

    pool.close()
    pool.join()

def single_r(n):
    """ Single core, returns """
    res = 0

    for i in range(n):
        res = f(n) % 1000 # Prevent overflow

    return res


def multi_r(n):
    """ Multi core, returns """
    pool = Pool(2)
    res = 0

    for i in range(n):
        res = pool.apply_async(f, (n,)).get() % 1000

    pool.close()
    pool.join()

    return res

# Run
n = 5000

if __name__ == "__main__":
    print(f"single({n})...", end='')
    single(n)
    print(" DONE")
    print(f"multi({n})...", end='')
    multi(n)
    print(" DONE")

    print(f"single_r({n})...", end='')
    single_r(n)
    print(" DONE")
    print(f"multi_r({n})...", end='')
    multi_r(n)
    print(" DONE")

The workload is f().

f() is run single-cored and dual-cored without return calls via single() and multi().

Then f() is run single-cored and dual-cored with return calls via single_r() and multi_r().

My result is that slowdown happens when f() is run multiprocessed with return calls. Without returns, it doesn't happen.

So single() takes q seconds. multi() is much faster. Good. Then single_r() takes q seconds. But then multi_r() takes much more than q seconds. Visual inspection of my system monitor corroborates this (a little hard to tell, but the multi(n) hump is shaded two colors, indicating activity from two different cores).

Also, corroborating video of the terminal outputs

Even with uniform workload, is this still IPC overhead? Is such overhead only paid when other processes return their results, and, if so, is there a way to avoid it while still returning results?

I don't think this will solve your issue, but when working with multiprocessing, use the __main__ conditional to not spawn a process that executes the whole module again, I added it to your code. — fixmycode, Nov 22 '19 at 04:13
Can you explain what you mean by _I've isolated an example where this happens AND doesn't happen for the same workload_ ? What's the average runtime for the different methods? Have you used a profiler? Also, use context managers for your pools, they're great. — AMC, Nov 22 '19 at 04:35
You are using `.get()` in your function `multi_r()` too soon. `AsyncResult.get()` is blocking, so you don't run this parallel in your setup. What happens is you submit one job, then await the result and only then the for-loop continues submitting the next job. You need to schedule and store all "futures" you get from `.apply_async()` first and only then call `.get()` in a second loop after all jobs are already on the way. — Darkonaut, Nov 23 '19 at 12:19
@Darkonaut Huh, interesting! I'm a noob; could you point me to a resource for how to "store futures"? — 6equj5, Nov 25 '19 at 18:25
It's just collecting them in a list like [here](https://stackoverflow.com/a/55577761/9059420) for example. — Darkonaut, Nov 25 '19 at 18:35
@Darkonaut You're correct! It's a shame I can't find [documentation](https://docs.python.org/3/library/multiprocessing.html#multiprocessing.pool.AsyncResult.get) saying .get() is blocking, but this must indeed be the case: if I first collect the apply_async() results in a list and then run .get() on them sequentially in the parent process, I see the same increase as in the `multi()` function that has no returns! — 6equj5, Nov 26 '19 at 19:26
Yeah, now that you say it, the wording in the docs could surely be improved. — Darkonaut, Nov 26 '19 at 19:38

6equj5 · Accepted Answer · 2019-11-26T20:15:35.610

As Darkonaut pointed out, the slowdown when using multiple processes in multi_r() is because the get() call is blocking:

for i in range(n):
        res = pool.apply_async(f, (n,)).get() % 1000

This effectively runs the workload sequentially or concurrently (more akin to multithreaded) while adding multiprocess overhead, making it run slower than the single-cored equivalent single_r()!

Meanwhile, multi() ran faster (i.e., ran in parallel correctly) because it contains no get() calls.

To run parallel and return results, collect result objects first as in:

def multi_r_collected(n):
    """ Multi core, collects apply_async() results before returning them """
    pool = Pool(2)
    res = 0

    res = [pool.apply_async(f, (n,)) for i in range(n)] # Collect first!

    pool.close()
    pool.join()

    res = [r.get() % 1000 for r in res] # .get() after!

    return res

Visual inspection of CPU activity corroborates the noticed speed-up; when run with 12 processes via Pool(12), there's a clean, uniform mesa of multiple cores clearly running at 100% in parallel (not the 50% mishmash of multi_r(n)).

Python core usage slower/under 100% with multiprocessing.Pool

1 Answers1