1

I was comparing parallel matrix multiplication with numba and matrix multiplication with numpy when I noticed that numpy isn't as fast with integers (int32).

import numpy as np
from numba import njit, prange

@njit()
def matrix_multiplication(A, B):
  m, n = A.shape
  _, p = B.shape
  C = np.zeros((m, p))
  for i in range(m):
    for j in range(n):
      for k in range(p):
        C[i, k] += A[i, j] * B[j, k]
  return C

@njit(parallel=True, fastmath=True)
def matrix_multiplication_parallel(A, B):
  m, n = A.shape
  _, p = B.shape
  C = np.zeros((m, p))
  for i in prange(m):
    for j in range(n):
      for k in range(p):
        C[i, k] += A[i, j] * B[j, k]
  return C

m = 100
n = 1000
p = 1500
A = np.random.randn(m, n)
B = np.random.randn(n, p)
A2 = np.random.randint(1, 100, size=(m, n))
B2 = np.random.randint(1, 100, size=(n, p))
A3 = np.ones((m, n))
B3 = np.ones((n, p))

# compile function
matrix_multiplication(A, B)
matrix_multiplication_parallel(A, B)

print('normal')
%timeit matrix_multiplication(A, B)
%timeit matrix_multiplication(A2, B2)
%timeit matrix_multiplication(A3, B3)
print('parallel')
%timeit matrix_multiplication_parallel(A, B)
%timeit matrix_multiplication_parallel(A2, B2)
%timeit matrix_multiplication_parallel(A3, B3)
print('numpy')
%timeit A @ B
%timeit A2 @ B2
%timeit A3 @ B3
normal
1.51 s ± 25.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
*1.56 s* ± 111 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.5 s ± 34.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
parallel
333 ms ± 13.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
408 ms ± 15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
313 ms ± 11.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
numpy
31.2 ms ± 1.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
*1.99 s* ± 4.37 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)**
28.4 ms ± 1.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

I found this answer explaining that numpy doesn't use BLAS for integers.

From what I understand, both numpy and numba make use of vectorization. I wonder what could be different in the implementations for a relatively consistent 25% increase in performance.


I tried reversing the order of operations in case less CPU resources were available towards the end. I made sure to not do anything while the program was running.

numpy
35.1 ms ± 1.64 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
*1.97 s* ± 44.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
32 ms ± 1.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
normal
1.48 s ± 33.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
*1.46 s* ± 15.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.47 s ± 29.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
parallel
379 ms ± 13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
461 ms ± 27.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
381 ms ± 16.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Trying the method in the answer doesn't really help.

import inspect
inspect.getmodule(matrix_multiplication)
<module '__main__'>

I tried it on Google Colab.

normal
2.28 s ± 407 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
*1.7 s* ± 277 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.6 s ± 317 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
parallel
1.33 s ± 315 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.66 s ± 425 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.34 s ± 327 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
numpy
64.9 ms ± 1.21 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
*2.14 s* ± 477 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
64.1 ms ± 1.55 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

It is possible to print the generated code, but I don't know how it can be compared to the numpy code.

for v, k in matrix_multiplication.inspect_llvm().items():
  print(v, k)

Going to the definition of np.matmul leads to matmul: _GUFunc_Nin2_Nout1[L['matmul'], L[19], None] in ".../site-packages/numpy/_init_.pyi". I think this is the C method being called because of the name "no BLAS". The code seems equivalent to mine, except for additional if statements.

For small arrays m = n = p = 10, numpy is faster.

normal
6.6 µs ± 99.2 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
6.72 µs ± 68.4 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
6.57 µs ± 62.6 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
parallel
63.5 µs ± 1.06 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
64.5 µs ± 23.9 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
63.3 µs ± 1.21 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
numpy
1.94 µs ± 56.7 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
2.53 µs ± 305 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
1.91 µs ± 37.1 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

m=10000 instead of 1000

normal
14.4 s ± 146 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
14.3 s ± 129 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
14.7 s ± 538 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
parallel
3.34 s ± 104 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
4.42 s ± 58.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
3.46 s ± 78.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
numpy
334 ms ± 16.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
19.4 s ± 655 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
248 ms ± 10.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Edit: Reported the issue

BPDev
  • 397
  • 1
  • 9
  • 1
    I think the `numpy` variation can be explained by the use, or not, of BLAS functions. In my quick tests, `float32` is fastest, then `float64`. All `int` sizes are the same, 10-20 times slower, suggesting good numpy, but not BLAS. `object` is much slower. – hpaulj Feb 20 '23 at 16:54

1 Answers1

1

You are comparing two different loop patterns. The pattern equivalent to the Numpy implementation will be like the following.

@njit()
def matrix_multiplication_slow(A, B):
  m, n = A.shape
  _, p = B.shape
  C = np.zeros((m, p))
  for i in range(m):
    for k in range(p):
      for j in range(n):
        C[i, k] += A[i, j] * B[j, k]
  return C

The above matrix_multiplication_slow() is slower than the original matrix_multiplication(), because reading the B[j, k] values iterating the j causes much more cache misses.

So, the current Numpy implementation is not cache friendly. It would be good to report this on here.

relent95
  • 3,703
  • 1
  • 14
  • 17
  • Strange, the original loop order is faster 216 ms ± 12.6 ms than this loop order 366 ms ± 52.5 ms, so I would think it's the one that's more cache friendly. (numpy: 298 ms ± 39 ms per loop) I wonder why they would use the less performant loop order. – BPDev Feb 20 '23 at 17:16
  • 1
    @BPDev, No, the Numpy loop order is more performant than the your loop order on average for m, n, and p values. – relent95 Feb 21 '23 at 00:32
  • 1
    I can't seem to find values of m, n and p for which this is true (except for small values < 30). Based on [this](https://stackoverflow.com/a/7395643/20898396) answer, we should iterate over the columns of B last ("for k in range(p)" equivalent to "for (int j = 0; j < n; j++)"). [tests](https://colab.research.google.com/drive/1IeQAckH5n7E57-cSTCoLJFOAmx0JDjfC?usp=sharing) in google collab for consistency – BPDev Feb 21 '23 at 18:05
  • @BPDev, with ```m, n, p = 1000, 1500, 100```, Numpy takes 0.13 s and Numba takes 0.17 s on my PC. But with ```m, n, p = 100, 1000, 1500```, Numpy takes 0.26 s and Numba takes 0.17 s. Try again. – relent95 Feb 22 '23 at 00:38
  • 1
    @BPDev, you are right. By comparing two Numba functions with different two loop patterns, I confirmed your original loop pattern perform better. I missed the cache miss. I'll update the answer for future readers. – relent95 Feb 22 '23 at 01:25