What is the fastest way to sum 2 matrices using Numba?

Question

I am trying to find the fastest way to sum 2 matrices of the same size using Numba. I came up with 3 different approaches but none of them could beat Numpy. Here is my code:

import numpy as np
from numba import njit,vectorize, prange,float64
import timeit
import time

# function 1: 
def sum_numpy(A,B):
    return A+B

# function 2: 
sum_numba_simple= njit(cache=True,fastmath=True) (sum_numpy)

# function 3: 
@vectorize([float64(float64, float64)])
def sum_numba_vectorized(A,B):
    return A+B

# function 4: 
@njit('(float64[:,:],float64[:,:])', cache=True, fastmath=True, parallel=True)
def sum_numba_loop(A,B):
    n=A.shape[0]
    m=A.shape[1]
    C = np.empty((n, m), A.dtype)

    for i in prange(n):
        for j in prange(m):
            C[i,j]=A[i,j]+B[i,j]
  
    return C

#Test the functions with 2 matrices of size 1,000,000x3:
N=1000000
np.random.seed(123)
A=np.random.uniform(low=-10, high=10, size=(N,3))
B=np.random.uniform(low=-5, high=5, size=(N,3)) 

t1=min(timeit.repeat(stmt='sum_numpy(A,B)',timer=time.perf_counter,repeat=3, number=100,globals=globals()))
t2=min(timeit.repeat(stmt='sum_numba_simple(A,B)',timer=time.perf_counter,repeat=3, number=100,globals=globals()))
t3=min(timeit.repeat(stmt='sum_numba_vectorized(A,B)',timer=time.perf_counter,repeat=3, number=100,globals=globals()))
t4=min(timeit.repeat(stmt='sum_numba_loop(A,B)',timer=time.perf_counter,repeat=3, number=100,globals=globals()))

print("function 1 (sum_numpy): t1= ",t1,"\n")
print("function 2 (sum_numba_simple): t2= ",t2,"\n")
print("function 3 (sum_numba_vectorized): t3= ",t3,"\n")
print("function 4 (sum_numba_loop): t4= ",t4,"\n")

Here are the results:

function 1 (sum_numpy): t1= 0.1655790419999903

function 2 (sum_numba_simple): t2= 0.3019776669998464

function 3 (sum_numba_vectorized): t3= 0.16486266700030683

function 4 (sum_numba_loop): t4= 0.1862256660001549

As you can see, the results show that there isn't any advantage in using Numba in this case. Therefore, my question is:
Is there any other implementation that would increase the speed of the summation ?

Numpy's addition is going to be as fast as it gets, at least on one processor. Of course it would be better to avoid allocations with `np.add(A, B, out=C)` if possible. — August, Oct 02 '22 at 05:32
Unroll the inner loop in *sum_numba_loop* `C[i,0] = A[i,0] + B[i,0]; C[i,1] = A[i,1] + B[i,1]; C[i,2] = A[i,2] + B[i,2]`, hint the data layout in the signature `@njit('float64[:,::1](float64[:,::1],float64[:,::1])', parallel=True)` is ~2x faster in my benchmarks on 4 cores of Intel Xeon Platinum 8259CL: 0.61 s (numpy) vs 0.28 s (numba). Your implementation fails to compilte with SIMD instructions. I can't tell if this is applicable for your hardware. — Michael Szczesny, Oct 02 '22 at 11:16
How about using [CuPy](https://docs.cupy.dev/en/stable/overview.html)? — alex, Oct 02 '22 at 12:47
@MichaelSzczesny Thank you for the suggestion. It didn't produce any improvment on my Mac M1 chip however. — Rafid Bendimerad, Oct 02 '22 at 16:04
Same on my machine: unrolling only works when the computation is compute bound on some machines. As for CuPy, it only worth it if data are already on the GPU. Otherwise, the data transfer will make it slower on nearly all machines. — Jérôme Richard, Oct 02 '22 at 19:50

score 7 · Accepted Answer · answered Oct 02 '22 at 20:13

Your code is bound by page-faults (see here, here and there for more information about this). Page-faults happens because the array is newly allocated. A solution is to preallocate it and then write within it so to no cause pages to be remapped in physical memory. np.add(A, B, out=C) does this as indicated by @August in the comments. Another solution could be to adapt the standard allocator so not to give the memory back to the OS at the expense of a significant memory usage overhead (AFAIK TC-Malloc can do that for example).

There is another issue on most platforms (especially x86 ones): the cache-line write allocations of write-back caches are expensive during writes. The typical solution to avoid this is to do non-temporal store (if available on the target processor, which is the case on x86-64 one but maybe not others). That being said, neither Numpy nor Numba are able to do that yet. For Numba, I filled an issue covering a simple use-case. Compilers themselves (GCC for Numpy and Clang for Numba) tends not to generate such instructions because they can be detrimental in performance when arrays fit in cache and compilers do not know the size of the array at compile time (they could generate a specific code when they can evaluate the amount of data computed but this is not easy and can slow-down some other codes). AFAIK, the only possible way to fix this is to write a C code and use low-level instructions or to use compiler directives. In your case, about 25% of the bandwidth is lost due to this effect, causing a slowdown up to 33%.

Using multiple threads do not always make memory-bound code faster. In fact, it generally barely scale because using more core do not speed up the execution when the RAM is already saturated. Few cores are generally required so to saturate the RAM on most platforms. Page faults can benefit from using multiple cores regarding the target system (Linux does that in parallel quite well, Windows generally does not scale well, IDK for MacOS).

Finally, there is another issue: the code is not vectorized (at least not on my machine while it can be). On solution is to flatten the array view and do one big loop that the compiler can more easily vectorize (the j-based loop is too small for SIMD instructions to be effective). The contiguity of the input array should also be specified for the compiler to generate a fast SIMD code. Here is the resulting Numba code:

@njit('(float64[:,::1], float64[:,::1], float64[:,::1])', cache=True, fastmath=True, parallel=True)
def sum_numba_fast_loop(A, B, C):
    n, m = A.shape
    assert C.shape == A.shape
    A_flat = A.reshape(n*m)
    B_flat = B.reshape(n*m)
    C_flat = C.reshape(n*m)
    for i in prange(n*m):
        C_flat[i]=A_flat[i]+B_flat[i]
    return C

Here are results on my 6-core i5-9600KF processor with a ~42 GiB/s RAM:

sum_numpy:                       0.642 s    13.9 GiB/s
sum_numba_simple:                0.851 s    10.5 GiB/s
sum_numba_vectorized:            0.639 s    14.0 GiB/s
sum_numba_loop serial:           0.759 s    11.8 GiB/s
sum_numba_loop parallel:         0.472 s    18.9 GiB/s
Numpy "np.add(A, B, out=C)":     0.281 s    31.8 GiB/s  <----
Numba fast:                      0.288 s    31.0 GiB/s  <----
Optimal time:                    0.209 s    32.0 GiB/s

The Numba code and the Numpy one saturate my RAM. Using more core does not help (in fact it is a bit slower certainly due to the contention of the memory controller). Both are sub-optimal since they do not use non-temporal store instructions that can prevent cache-line write allocations (causing data to be fetched from the RAM before being written back). The optimal time is the one expected using such instruction. Note that it is expected to reach only 65-80% of the RAM bandwidth because of RAM mixed read/writes. Indeed, interleaving reads and writes cause low-level overheads preventing the RAM to be saturated. For more information about how RAM works, please consider reading Introduction to High Performance Scientific Computing -- Chapter 1.3 and What Every Programmer Should Know About Memory (and possibly this).

What is the fastest way to sum 2 matrices using Numba?

1 Answers1