I am using numpy.sum() for real-time image processing on a Raspberry Pi 4B (four core ARM). The data is a 2-dimensional array of 8-bit unsigned integers (type uint8), of typical size 2048 x 2048. One of the important operations is to sum this along rows and columns:
vertical = np.sum(data, axis=0)
horizontal = np.sum(data, axis=1)
I have noticed that these operations saturate only one CPU core, leaving the other three cores idle. This is in contrast to multi-threaded numpy operations such as a = np.dot(data,data), which saturate all four CPU cores.
I have sped up my code by launching four separate execution threads. The four threads do the following operations:
thread 1: vertical1 = np.sum(data[ 0:1024, : ], axis=0)
thread 2: vertical2 = np.sum(data[1024:2048, : ], axis=0)
thread 3: horizontal1 = np.sum(data[ : , 0:1024], axis=1)
thread 4: horizontal2 = np.sum(data[ : ,1024:2048], axis=1)
After the threads complete, I sum the two vertical and two horizontal arrays, to get my desired result.
Is there a way to configure or build numpy so that the type of multi-core parallelisation that I am describing can be done automatically by np.sum()? This is clearly happening in some of the linear algebra routines, and in fact I can get a small speedup by using np.dot() to dot vectors containing all ones into my frame matrix. However, although this does use multiple cores, it's much slower than my simple "four thread" approach described above.
(For those who believe that Python can only run single threaded, I would like to quote from Python documentation: "...if you are using numpy to do array operations then Python will release the global interpreter lock (GIL), meaning that if you write your code in a numpy style, much of the calculations will be done in a few array operations, providing you with a speedup by using multiple threads.")