#numpy Is there a way to get numpy.sum() to use multiple cores?

Question

I am using numpy.sum() for real-time image processing on a Raspberry Pi 4B (four core ARM). The data is a 2-dimensional array of 8-bit unsigned integers (type uint8), of typical size 2048 x 2048. One of the important operations is to sum this along rows and columns:

vertical   = np.sum(data, axis=0) 
horizontal = np.sum(data, axis=1)

I have noticed that these operations saturate only one CPU core, leaving the other three cores idle. This is in contrast to multi-threaded numpy operations such as a = np.dot(data,data), which saturate all four CPU cores.

I have sped up my code by launching four separate execution threads. The four threads do the following operations:

thread 1:  vertical1   = np.sum(data[   0:1024,    :    ], axis=0)
thread 2:  vertical2   = np.sum(data[1024:2048,    :    ], axis=0)
thread 3:  horizontal1 = np.sum(data[    :    ,   0:1024], axis=1)
thread 4:  horizontal2 = np.sum(data[    :    ,1024:2048], axis=1)

After the threads complete, I sum the two vertical and two horizontal arrays, to get my desired result.

Is there a way to configure or build numpy so that the type of multi-core parallelisation that I am describing can be done automatically by np.sum()? This is clearly happening in some of the linear algebra routines, and in fact I can get a small speedup by using np.dot() to dot vectors containing all ones into my frame matrix. However, although this does use multiple cores, it's much slower than my simple "four thread" approach described above.

(For those who believe that Python can only run single threaded, I would like to quote from Python documentation: "...if you are using numpy to do array operations then Python will release the global interpreter lock (GIL), meaning that if you write your code in a numpy style, much of the calculations will be done in a few array operations, providing you with a speedup by using multiple threads.")

what makes you think that `np.dot` runs on *"all four CPU cores"*? — RomanPerekhrest, Feb 22 '23 at 14:02
Hi Roman, if you run "top" on a linux or mac, you will see that the column labeled CPU will go well above 100% for the python process that is running numpy. On my 4-core RP, it reaches close to 400%, and on an 8-core mac it gets to close to 800%. Please see here for details of such testing: https://github.com/raspberrypi/picamera2/discussions/578#discussioncomment-5061512 . Cheers, Bruce — B A, Feb 22 '23 at 17:35
`dot/matmul` use `BLAS` functions where possible, and those can be complied to make optimal use of cores. At the other extreme, pure python code has a lock, GIL, the limits multithreaded (and core) operations. `numpy` compiled code can turn off that GIL where possible, but generally isn't compiled to handle multicore processing like BLAS. — hpaulj, Feb 22 '23 at 17:57

score 4 · Answer 1 · answered Feb 22 '23 at 19:04

Is there a way to configure or build numpy so that the type of multi-core parallelisation that I am describing can be done automatically by np.sum()?

AFAIK, no : Numpy just release the GIL in some computing parts so using multiple threads can increase performance but it does not currently create multiple threads (note that not all Numpy functions fully release the GIL since some of them are implemented in pure-Python internally).

One solution is to use an external package to do that for you. Numba should be able to do that. Numexpr generally does that but the sum is not currently implemented in parallel, and it is not efficient either.

This is clearly happening in some of the linear algebra routines

Yes. This is because Numpy calls function to the default BLAS library available on the target platform. This BLAS implementation is generally OpenBLAS which is highly optimized and use multiple threads. Only few Numpy functions calls BLAS ones.

In practice, parallelizing np.sum is not efficient. In fact, np.sum is inefficient even in sequential in this case because it has to convert the np.uint8 to a np.int_ array (32-bit or 64-bit regarding the target platform) so it then compute the sum without overflow. This conversion function is slow because it creates a much bigger array and the RAM is slow. Writing/Reading a 4~8 time bigger arrays is very inefficient, especially in parallel. Indeed, the RAM is a shared resource and using few core is generally enough to saturate it (assuming the code is efficient, eg. use SIMD instructions, which is not the case yet) so it often does not scale to operate on big arrays in parallel. For more information, please read this related post.

I advise you to use a recent version of Numpy since we recently optimized Numpy functions so they better use SIMD instructions (although mainstream ARM platforms should not be impacted much). We do not plan to change this conversion overhead any time soon.

On Linux, it is certainly better to specify dtype=np.int32 to sum so to get a faster computation. Still, the best solution is not to do the conversion at all. Numba and Cython can be used to do that easily. The resulting parallel code can not only be faster, but also scale better and use far less memory.

You can also use a smaller dtype=np.uint16 if you do the computation by chunks so to avoid overflow. However, this is cumbersome.

Finally, note that OpenCV which is specifically designed for fast image manipulations probably support such a basic feature. Most basic OpenCV functions are parallel and more generally well optimized. In fact, this is the mainstream way to compute image efficiently in many languages.

Curious, why not internally analyze and pick `arr.dtype` (if it's not `uint64`) in `np.sum(arr)` to avoid conversion? — RomanPerekhrest, Feb 23 '23 at 10:05
@RomanPerekhrest AFAIK, this is because users generally expect no overflows to happen when they sum up arrays with a small dtype. For example, for a uint8 dtype, summing 100+100+100 and getting 44 is rather unexpected for most users. Numpy choose to use the default integer size on the target architecture so to avoid that and get safer results (I think `int64` on 64-bit machines would have been better since the default integer size is 32-bit on Windows unfortunately). — Jérôme Richard, Feb 23 '23 at 11:33
Dear Jérôme, thanks for the helpful reply. I'm currently using numpy 1.24.2. Is that recent enough? Yes, the type conversion is more costly than the addition. If needed I can write my own C code to do the conversion and summation and link that to python. Indeed the fastest way to sum it (numpy) is to sum 256 elements into a unit16, and then to sum those 6 or 8 uint16 to a float32. It's good point about opencv. I found cv2.reduce() and will investigate if that is faster. Cheers, Bruce — B A, Feb 24 '23 at 16:52
This version should be recent enough (the fix has been merged Mar 7, 2022). OpenCV should use such a fast SIMD reduction method AFAIK. There are even some specific x86 instruction for that though the code should quickly be memory bound even with a not-very-optimized approach unless the image completely fit in the L3 cache. — Jérôme Richard, Feb 25 '23 at 01:06

#numpy Is there a way to get numpy.sum() to use multiple cores?

1 Answers1