Why do numpy arrays take longer to process than lists with multiprocessing library?

Question

I am running on a machine with two AMD 7302 16 core processors (a total of 32 core). I'm on a Red Hat 8.4 system and using Python 3.10.6.

I've recently started learning the multiprocessing library. Inspired by first example on the documentation page, I wrote my own little code :

from multiprocessing import Pool
import numpy as np
import sys
import datetime

def f(x):
    return x**2

def main(DataType="List", NThr=2, Vectorize=False):
    N = 5*10**7           # number of elements
    n = NThr              # number of threads
    y = np.zeros(N)
    # Use list
    if(DataType == "List"):
        x = []
        for i in range(N):
            x.append(i)
    # Use Numpy
    elif(DataType=="Numpy"):
        x = np.zeros(N)
        for i in range(len(x)):
            x[i] = i
    # Run parallel code
    t0 = datetime.datetime.now()
    if(n==1):
        if(DataType == "Numpy" and Vectorize == True):
            y = np.vectorize(f)(x)
        else:
            for i in range(len(x)):
                y[i] = f(x[i])
    else:
        with Pool(n) as p:
            y = p.map(f, x)
    t1 = datetime.datetime.now()
    dt = (t1 - t0).total_seconds()
    print("{} : Vect = {}, n = {}, time : {}s".format(DataType,Vectorize,n,dt))
    sys.exit(0)

if __name__ == "__main__":
    main()

I noticed that when I try to run p.map() over a numpy array, it performs substantially worse. Here is the output from several runs (python mycode.py) after twiddling the args to main :

Numpy : Vect = True, n = 1, time : 9.566441s
Numpy : Vect = False, n = 1, time : 16.00333s
Numpy : Vect = False, n = 2, time : 143.331352s
List : Vect = False, n = 1, time : 21.11657s
List : Vect = False, n = 2, time : 11.868897s
List : Vect = False, n = 5, time : 6.162561s

Look at the (Numpy, n=2) run at 143s. It's run time is substantially worse than the (List, n=2) run at 11.9s. It is also much worse than either of the (Numpy, n=1) runs.

Question :

What makes numpy arrays take so long to run with the multiprocessing library, specifically when NThr==2?

EDIT :

Per a comment's suggestion, I ran both versions (Numpy, n=2) and (List, n=2) through the profiler :

>>> import cProfile                                                                                                                                                 
>>> from mycode import main                                                                                                                           
>>> cProfile.run('main()')

and compared them side by side. The most time consuming function calls and the calls with different numbers to them are listed below.

For Numpy version :

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
# Time consuming
1    0.000    0.000  138.997  138.997 pool.py:362(map)
1    0.000    0.000  138.956  138.956 pool.py:764(wait)
1    0.000    0.000  138.956  138.956 pool.py:767(get)
4    0.000    0.000  138.957   34.739 threading.py:288(wait)
4    0.000    0.000  138.957   34.739 threading.py:589(wait)
14/1    0.000    0.000  145.150  145.150 {built-in method builtins.exec}
19  138.957    7.314  138.957    7.314 {method 'acquire' of '_thread.lock' objects}
# Different number of calls
6    0.000    0.000    0.088    0.015 popen_fork.py:24(poll)
1    0.000    0.000    0.088    0.088 popen_fork.py:36(wait)
1    0.000    0.000    0.088    0.088 process.py:142(join)
10    0.000    0.000    0.000    0.000 process.py:99(_check_closed)
18    0.000    0.000    0.000    0.000 util.py:48(debug)
76    0.000    0.000    0.000    0.000 {built-in method builtins.len}
2    0.000    0.000    0.000    0.000 {built-in method numpy.zeros}
17    0.000    0.000    0.000    0.000 {built-in method posix.getpid}
6    0.088    0.015    0.088    0.015 {built-in method posix.waitpid}
3    0.000    0.000    0.000    0.000 {method 'append' of 'list' objects}

For List version :

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
# Time consuming
1    0.000    0.000   13.961   13.961 pool.py:362(map)
1    0.000    0.000   13.920   13.920 pool.py:764(wait)
1    0.000    0.000   13.920   13.920 pool.py:767(get)
4    0.000    0.000   13.921    3.480 threading.py:288(wait)
4    0.000    0.000   13.921    3.480 threading.py:589(wait)
14/1    0.000    0.000   24.475   24.475 {built-in method builtins.exec}
19   13.921    0.733   13.921    0.733 {method 'acquire' of '_thread.lock' objects}
# Different number of calls
7    0.000    0.000    0.132    0.019 popen_fork.py:24(poll)
2    0.000    0.000    0.132    0.066 popen_fork.py:36(wait)
2    0.000    0.000    0.132    0.066 process.py:142(join)
12    0.000    0.000    0.000    0.000 process.py:99(_check_closed)
19    0.000    0.000    0.000    0.000 util.py:48(debug)
75    0.000    0.000    0.000    0.000 {built-in method builtins.len}
1    0.000    0.000    0.000    0.000 {built-in method numpy.zeros}
18    0.000    0.000    0.000    0.000 {built-in method posix.getpid}
7    0.132    0.019    0.132    0.019 {built-in method posix.waitpid}
50000003    2.780    0.000    2.780    0.000 {method 'append' of 'list' objects}

Note that for the List version, there are 50000003 calls to append() compared to 3 calls to append() in the Numpy version. due to the initialization of the x.

Does this answer your question? [Python multiprocessing and shared numpy array](https://stackoverflow.com/questions/35867650/python-multiprocessing-and-shared-numpy-array) — Random Davis, Dec 06 '22 at 18:18
It does not, I'm asking specifically about the run time using a numpy array with and without the multiprocessing library. I'm comparing this to how it runs with a list. Modified question to be more clear about this fact — irritable_phd_syndrome, Dec 06 '22 at 18:21
They explained in that post that serialization was a significant factor; I'm not sure that built-in lists are as expensive to serialize. Regardless, have you [profiled](https://stackoverflow.com/questions/582336/how-do-i-profile-a-python-script) your script at all to see exactly where in the code that the slowdown really is, objectively? — Random Davis, Dec 06 '22 at 18:24
As an aside, ` # number of threads` it is not the number of threads, it is the number of processes — juanpa.arrivillaga, Dec 06 '22 at 18:40
Another aside, in the case of `def f(x): return x * 2` it is very important to understand that `f(x)` is **much faster** than `np.vectorize(f)(x)` for a `numpy.ndarray`. Why are you using `np.vectorize` at all? — juanpa.arrivillaga, Dec 06 '22 at 19:08
I'm not looking for the fastest way of computing x**2. I was trying to get a handle on why using a numpy array with `p.map` is so much slower than either `p.map` with a list, or any of the serial options (e.g. `np.vectorize` or looping through the numpy array directly). Also, `np.vectorize` was a bit faster than just the straight loop over `x` with `n=1` (i.e. 9.6s vs 16s) — irritable_phd_syndrome, Dec 06 '22 at 19:20
I would characterize every part of this as being incorrectly written code. There is nothing here that is implemented correctly. The underlying question you're asking is almost completely irrelevant as a result. — CJR, Dec 07 '22 at 13:27
Your timings consist of almost entirely overhead and very little useful work - most of what you're timing is serialization/deserialization, inter-process communication, and the substantial overhead incurred by working with NumPy arrays on an element-by-element basis instead of with array-based operations. Very little of the runtime is spent actually squaring numbers. — user2357112, Dec 07 '22 at 13:30
(Note that `numpy.vectorize` falls under "working with NumPy arrays on an element-by-element basis" and not under "array-based operations". Quoting the [docs](https://numpy.org/doc/stable/reference/generated/numpy.vectorize.html), "The `vectorize` function is provided primarily for convenience, not for performance. The implementation is essentially a for loop.") — user2357112, Dec 07 '22 at 13:33
@user2357112 - Not quite. When n=2, both whether x is a list or numpy array, it gets passed to `y = p.map(f, x)`. How would you propose doing parallel operations on a numpy array using the multiprocessing library? Clearly running it through `p.map` is much slower than serially. — irritable_phd_syndrome, Dec 07 '22 at 13:33
@irritable_phd_syndrome: `p.map(f, x)` also falls under "working with NumPy arrays on an element-by-element basis". For each element, you're constructing a wrapper object, constructing a pickle representing the wrapper object, sending the pickle to a worker process through inter-process communication, deserializing the pickle, squaring the resulting wrapper, constructing a new pickle for the square, sending that back to the master through more inter-process communication, deserializing, and appending the result to a list. That is very, very slow. — user2357112, Dec 07 '22 at 13:39
@user2357112 - How does this differ from the List case? In the List case, I actually see a speed up as I increase n=1,2,5. Is there no good way to parallelize operations, which are NOT already part of the numpy library, on numpy arrays? I do realize that I can set`OMP_NUM_THREADS` to control the inbuilt numpy operations which utilize BLAS on the backend, but I want to do non-numpy operations on numpy arrays. — irritable_phd_syndrome, Dec 07 '22 at 13:49
The most basic solution is to chunk the array to eliminate 99% of your serialization overhead (e.g. make two map calls instead of 50000000). A lot of multiprocessing lives and dies on implementation details and it is necessary to do this correctly to have it work well. — CJR, Dec 07 '22 at 13:53
@CJR - I don't think it isn't making 50000000 calls to `p.map`. I'm calling `with Pool(n) as p` where `n=Nthr=2`. — irritable_phd_syndrome, Dec 07 '22 at 13:55
@irritable_phd_syndrome: The list version already stores Python object pointers instead of a buffer full of machine scalars, so it doesn't have to create wrapper objects on the fly. Serializing and deserializing plain Python numbers is also much faster than serializing and deserializing NumPy scalars, and the resulting pickles are shorter, too. (Despite what it might look like, the objects you receive when you extract elements from a NumPy array are not plain Python numbers.) — user2357112, Dec 07 '22 at 13:59
@user2357112 - It seems like your comments put together would consist of an answer. If possible, it would be very awesome if it included advice on 'properly' using `numpy` arrays with the `multiprocessing` library. — irritable_phd_syndrome, Dec 07 '22 at 14:00
Note that by initializing `y` as a NumPy array, you've thrown off your `n=1` timings for the list version - only the `n=1` version goes through the overhead of writing results one by one back to a NumPy array. The `y = p.map(f, x)` code path throws away the initial `y` array. — user2357112, Dec 07 '22 at 14:02
You're making a map call, which includes serialization, processing, and serialization of the result, for every element in x. That is horrifyingly inefficient. — CJR, Dec 07 '22 at 14:42
I'm using the `p.map` call in the exact same way that the first example on https://docs.python.org/3/library/multiprocessing.html does. They use it to square the elements of a 3 element list. Extending it to a non-trivial example would be simply making the list longer (which is what I'm doing). — irritable_phd_syndrome, Dec 07 '22 at 15:03

score 2 · Answer 1 · answered Dec 09 '22 at 14:42

There's so much going on in the question it's hard to know where to start! The main issues are:

Your function f isn't "doing" anything computationally intensive
When using multiprocessing.Pool.map(fn, objs) then objs is iterated over

Put these together and you're basically just benchmarking the pickle module. Which for your case is just benchmarking pickle.loads(pickle.dumps(list(y))) and results in the terrible performance you see.

To explain in a bit more detail, I'll write some code:

from time import perf_counter
from multiprocessing import Pool
import pickle

import numpy as np

class CtxTimer:
    "Context manager to time execution of code and save runtime for later"
    def __init__(self, message):
        self.message = message

    def __enter__(self):
        self.start_time = perf_counter()
        return self

    def __exit__(self, exc_type, exc_value, traceback):
        self.end_time = perf_counter()
        print(
            f"{self.message:15} "
            f"time={self.wall*1000:.2f}ms"
        )

    @property
    def wall(self):
        return self.end_time - self.start_time

def f(x):
    return x**2

# bit shorter than your example, but don't want to wait that long
objs = list(range(10**6))
# direct conversion to int64 numpy array (your code was using floats)
objs_np = np.array(objs)

with CtxTimer("Naive Python") as py_naive:
    list(map(f, objs))

with Pool(4) as pool:
    with CtxTimer("MP Pool Python") as py_pool:
        pool.map(f, objs)

    with CtxTimer("MP Pool Numpy") as np_pool:
        pool.map(f, objs_np)

with CtxTimer("Pickle Python") as py_dumps:
    buf = pickle.dumps(objs)

print(f"  {len(buf)/1e6 = :.2f}")

with CtxTimer("Unpickle Python") as py_loads:
    pickle.loads(buf)

with CtxTimer("Pickle Numpy") as np_dumps:
    buf = pickle.dumps(list(objs_np))

print(f"  {len(buf)/1e6 = :.2f}")

with CtxTimer("Unpickle Numpy") as np_loads:
    pickle.loads(buf)

On my deliberately underclocked laptop and limiting to 4 subprocesses, this gives:

The Naive Python block takes ~550ms. So in 1 second, we can evaluate ~3million function calls, with each just squaring an integer.
Pickle Python takes ~30ms, and Unpickle Python takes ~60ms. These operations are going to be happening every time you use the Pool to transfer objects between processes. Note that this produces a ~5MB object.
Pickle Numpy takes ~3700ms and Unpickle Numpy takes ~450ms. The difference is because the Python special cases a few common datatypes, like list and int, that we happen to be benchmarking. Because this is just a list of scalar Numpy values, you hit the slow path.
MP Pool Python takes ~300ms and MP Pool Numpy takes ~4300ms.

To explain to these numbers, you need to think what is happening behind the scenes in the pool. First, there's the parent process that spawns the four child processes. This has to send all the work to the children, so this involves doing pickle.dumps on everything and sending them to the child processes. It then waits for the results to come back and decodes them using pickle.loads. This seems to be the limiting factor, especially so for the Numpy case. Each child process would get a quarter of the items, decode them, process them, then encode the results to send back.

Calculating this for the Numpy variant I get:

max(np_dumps.wall + np_loads.wall, (np_dumps.wall + py_naive.wall + np_loads.wall) / 4)

outputting ~4.2s, note I'm using max because we have to wait for the slowest part of the system. This agrees pretty well with my observed ~4300ms, but the same calculation for the Python variant isn't so close.

Just to point out why Numpy vectorisation should be preferred, the following code:

with CtxTimer("Naive Numpy") as np_naive:
    f(objs_np)

runs in ~2ms, i.e. more than 200x faster that the naive Python implementation. It's worth getting this speedup before dealing with the awkwardness of Python imposed process-based parallelism, which could give you a maximum 32x speedup doing pure-Python things, or 6400x when doing more optimal things.

Why do numpy arrays take longer to process than lists with multiprocessing library?

1 Answers1