Python multiprocessing performance only improves with the square root of the number of cores used

Question

I am attempting to implement multiprocessing in Python (Windows Server 2012) and am having trouble achieving the degree of performance improvement that I expect. In particular, for a set of tasks which are almost entirely independent, I would expect a linear improvement with additional cores.

I understand that--especially on Windows--there is overhead involved in opening new processes [1], and that many quirks of the underlying code can get in the way of a clean trend. But in theory the trend should ultimately still be close to linear for a fully parallelized task [2]; or perhaps logistic if I were dealing with a partially serial task [3].

However, when I run multiprocessing.Pool on a prime-checking test function (code below), I get a nearly perfect square-root relationship up to N_cores=36 (the number of physical cores on my server) before the expected performance hit when I get into the additional logical cores.

Here is a plot of my performance test results :
_{( "Normalized Performance" is [ a run time with 1 CPU-core ] divided by [ a run time with N CPU-cores ] ).}

Is it normal to have this dramatic diminishing of returns with multiprocessing? Or am I missing something with my implementation?

import numpy as np
from multiprocessing import Pool, cpu_count, Manager
import math as m
from functools import partial
from time import time

def check_prime(num):

    #Assert positive integer value
    if num!=m.floor(num) or num<1:
        print("Input must be a positive integer")
        return None

    #Check divisibility for all possible factors
    prime = True
    for i in range(2,num):
        if num%i==0: prime=False
    return prime

def cp_worker(num, L):
    prime = check_prime(num)
    L.append((num, prime))


def mp_primes(omag, mp=cpu_count()):
    with Manager() as manager:
        np.random.seed(0)
        numlist = np.random.randint(10**omag, 10**(omag+1), 100)

        L = manager.list()
        cp_worker_ptl = partial(cp_worker, L=L)

        try:
            pool = Pool(processes=mp)   
            list(pool.imap(cp_worker_ptl, numlist))
        except Exception as e:
            print(e)
        finally:
            pool.close() # no more tasks
            pool.join()

        return L


if __name__ == '__main__':
    rt = []
    for i in range(cpu_count()):
        t0 = time()
        mp_result = mp_primes(6, mp=i+1)
        t1 = time()
        rt.append(t1-t0)
        print("Using %i core(s), run time is %.2fs" % (i+1, rt[-1]))

Note: I am aware that for this task it would likely be more efficient to implement multithreading, but the actual script for which this one is a simplified analog is incompatible with Python multithreading due to GIL.

@Megalng tqdm generates a progress bar for any iterable; I'll remove that for clarity. — KPM, May 07 '18 at 20:14
Did you see the warning in the ``__doc__`` of imap, that it can be much slower than map? — MegaIng, May 07 '18 at 20:17
The `multiprocessing` docs should explain what they are doing better. A Manager _holds Python objects and allows other processes to manipulate them using proxies_ - meaning it is synching the list to all workers. You'd be much better off with a `multiprocessing.Array` and give each work item the index where the result should be placed. Also, `map` is better than `list(imap)` and likely `chunksize=1` will help. — tdelaney, May 07 '18 at 20:48
In fact, since you are just making a list, return the int and the result of map is already the list you want. — tdelaney, May 07 '18 at 20:49
32-bit processes (native or WOW64) support up to 32 logical cores. 64-bit Windows uses processor groups that have up to 64 logical cores per group, and a 64-bit process executes in a single group by default (a thread can be moved to another group). multiprocessing has no support for Windows processor groups, and it certainly won't actively manage threads to distribute a workload evenly across all cores. One or more processor groups may be under-utilized if the processes that end up in that group aren't as CPU intensive. — Eryk Sun, May 08 '18 at 02:17
Also. `os.cpu_count()` may be incorrect. In Python 3 it's calling `GetMaximumProcessorCount` (equivalent to POSIX `_SC_NPROCESSORS_CONF`) instead of `GetActiveProcessorCount` (equivalent to POSIX `_SC_NPROCESSORS_ONLN`). For systems that support hot-plugging CPUs, this count could be much larger than the number of actually available cores. — Eryk Sun, May 08 '18 at 02:19
If you have 36 CPUs with 72 logical cores, then it's probably split into two groups of 36 cores each. The above `os.cpu_count()` bug may report the max available if both groups were completely full, i.e. 128 cores. — Eryk Sun, May 08 '18 at 02:27
Every communication between the main worker and the additional worker gets pickled. This can be severe a bottleneck, which can even lead to negative scaling. (simple and fast function, but lots of pickling). Your example can be easier and much faster implemented using Numba or Cython which provides a parfor loop (multithreading without unessecary memory copies,but with some restrictions) — max9111, May 08 '18 at 12:01
@max9111 with all due respect, the post seems to be related more to the **scaling** per-se, than to performance ( the less efficiency of the computing ) On a closer look, there are explicit inefficiencies in the job-definition, which may well be attributed to a need to "generate" a long running calculation work-package, right due to a need to better demonstrate a resources-mapping related scaling in **[TIME]**-domain ( it would't benefit from using a more efficient computing strategy, would it? ) Last but not least, I love numba-tools, yet, lists, objects & generators 've long been unsupported — user3666197, May 08 '18 at 12:36
@user3666197 Please don't get me wrong. It is not that unlikely that the scaling in Time domain is affected due to some overhead (pickling or whatsoever). I don't get such negative scaling on my machine (4C/8T) in the approach shown above and in a Numba approach. But nevertheless the Numba approach scales better on the logical cores in the time-domain. (There is no really signifcant difference when using physical cores, but a growing difference when it comes to the logical cores) — max9111, May 08 '18 at 13:45
Note that my comments weren't directed toward the result for up to 36 cores. user3666197 addressed the primary issue there. My first comment was directed at the graph result for 36 to 72 cores. I wouldn't expect this result for a system with 36 physical CPUs and 72 logical cores, but my assumption about the system could be wrong. The second and third comment addresses the range from 72 to 128. This definitely looks like the known bug in `os.cpu_count`. If the system has 72 logical cores, the CPU count should be 72, not 128. — Eryk Sun, May 08 '18 at 13:51
I believe the current answer is lacking some critical practical aspects because the question itself is missing 1) a system description 2) system configuration (e.g. HyperThreading, Turbo Boost active) 3) the python version 4) Un-normalized performance numbers. @KellanM Please add them to the question so that there is a chance to give you a practical answer. — Zulan, May 11 '18 at 11:18
@KPM would you mind to finally click & accept the best answer proided? That is how StackOverflow used to work since ever & its fair, isn't it? — user3666197, Dec 23 '20 at 13:33

user3666197 · Answer 1 · 2019-07-22T20:40:38.390

_{@KellanM deserved [+1] for quantitative performance monitoring}

am I missing something with my implementation?

Yes, you abstract from all add-on costs of the process-management.

While you have expressed an expectation of " a linear improvement with additional cores. ", this would hardly appear in practice for several reasons ( even the hype of communism failed to deliver anything for free ).

Gene AMDAHL has formulated the inital law of diminishing returns.
A more recent, re-formulated version, took into account also the effects of process-management {setup|terminate}-add-on overhead costs and tried to cope with atomicity-of-processing ( given large workpackage payloads cannot get easily re-located / re-distributed over available pool of free CPU-cores in most common programming systems ( except some indeed specific micro-scheduling art, like the one demonstrated in Semantic Design's PARLANSE or LLNL's SISAL have shown so colourfully in past ).

A best next step?

If indeed interested in this domain, one may always experimentally measure and compare the real costs of process management ( plus data-flow costs, plus memory-allocation costs, ... up until the process-termination and results re-assembly in the main process ) so as to quantitatively fair record and evaluate the add-on costs / benefit ratio of using more CPU-cores ( that will get, in python, re-instated the whole python-interpreter state, including all its memory-state, before a first usefull operation will get carried out in a first spawned and setup process ).

Underperformance ( for the former case below )
if not disastrous effects ( from the latter case below ),
of either of ill-engineered resources-mapping policy, be it
an "under-booking"-resources from a pool of CPU-cores
or
an "over-booking"-resources from a pool of RAM-space
are discussed also here

The link to the re-formulated Amdahl's Law above will help you evaluate the point of diminishing returns, not to pay more than will ever receive.

Hoefinger et Haunschmid experiments may serve as a good practical evidence, how a growing number of processing-nodes ( be it a local O/S managed CPU-core, or a NUMA distributed architecture node ) will start decreasing the resulting performance,
where a Point of diminishing returns ( demonstrated in overhead agnostic Amdahl's Law )
will actually start to become a Point after which you pay more than receive. :

Good luck on this interesting field!

Last, but not least,

NUMA / non-locality issues get their voice heard, into the discussion of scaling for HPC-grade tuned ( in-Cache / in-RAM computing strategies ) and may - as a side-effect - help detect the flaws ( as reported by @eryksun above ). One may feel free to review one's platform actual NUMA-topology by using lstopo tool, to see the abstraction, that one's operating system is trying to work with, once scheduling the "just"-[CONCURRENT] task execution over such a NUMA-resources-topology:

I like your analysis, but ... In a forking system like linux, you don't have to re-instate the python interpreter.. and even if you did, that's only once per pool worker. — tdelaney, May 07 '18 at 20:58
Yes, @tdelaney your notice is coorect, with **a few exceptions :** (1) O/P explicitly stated Windows O/S, where all ***forking* is not supported** ( so far, 2018/Q2 ) (2) Unix / Linux O/S forking-tricks seem as **not supported in py2.7.x** (3) using additional process-to-process communication methods ( as you have proposed to use **`.Array()` will add immense add-on SER/DES overhead costs** `pickle.{dumps|loads}()`. Sure, the above proposed algorithm was not polished for HPC-grade ultimate processing efficiency ( which was not merit of the post ), **yet, no reason to add even more overheads** — user3666197, May 08 '18 at 11:53
Boldly stated. You posted a general observation and I mentioned mitigating factors (forking system overhead and the one-time cost). Not sure what the problem here is. — tdelaney, May 08 '18 at 14:57
The answer is abstract and theoretical but does not address the particular question. There is no practical evidence that suggests this particular code suffers from anything of what you describe - mainly because the question lacks essential information. I see typical run times for `cp_worker` from 600 ms to 3500 ms - just how much (de)serialization/communication overhead is there supposed to be for such an impact on performance? — Zulan, May 11 '18 at 11:29
Please don't use block quotes for emphasis. They are for marking up quotations, not for hilighting arbitrary blocks of text or drawing attention to your titles. We've had this discussion before, there are objectively right and wrong ways to use some of the formatting tools this site provides, such as block quotes. You are generally pretty free to format your posts as you like, but if somebody fixes a problem where you've incorrectly marked up your post, you are expected to leave their edits alone. — user229044, May 11 '18 at 12:46
@tdelaney - being 4 years late after your comment above (~14 months was unfairly banned for questioning censoring + other hate-promoting pathologies here ), let me react to your fair claim about fork-ed processing. While it looks great on a first look, there are hidden problems - best documented by other StackOverflow member - in a story how bad the things can turn "under the hood" if forking was relied on - https://pythonspeed.com/articles/python-multiprocessing/ I hope you can enjoy this hands-on "sharks in a Poll" story ( I did ) to avoid a whole class of adverse impacts not mentioned above — user3666197, Mar 22 '22 at 13:17
@Zulan (4 years late,ref.text above) let me address your faint point on evidence. Let me bring a *Live GUI-interactive simulator* of how much we will have to pay to start more than 1 stream of process-flow (costs vary, lower for threads, larger for MPI-based distributed operations, highest for multiprocessing-processes, as used in Python Interpreter, where N-many copies of the main Python Interpreter process get first copied (RAM-allocs+copies,O/S-sched.spawned procs+SER/xfer/DES cost).Set Ncpu=8 +move O-slider higher from 0.01% https://www.desmos.com/calculator/zfrrlfeiji Only then slide P… □ — user3666197, Mar 22 '22 at 13:29

Python multiprocessing performance only improves with the square root of the number of cores used

1 Answers1

Yes, you abstract from all add-on costs of the process-management.

A best next step?

Last, but not least,

Linked