Large matrix multiplication in numpy slower than multiple small matrix multiplications with the same number of FLOPS

Question

import numpy as np
import time


C = np.random.rand(4, 100000)

# TEST A

times_A = []
for _ in range(10):
    t0 = time.perf_counter()
    A = np.random.rand(100, 3, 4)
    X = A.dot(C)
    times_A.append(time.perf_counter() - t0)

# TEST B

times_B = []
for _ in range(10):
    t0 = time.perf_counter()
    for _ in range(100):
        B = np.random.rand(3, 4)
        X = B.dot(C)
    times_B.append(time.perf_counter() - t0)


print('TIME A: ', np.mean(times_A))
print('TIME B: ', np.mean(times_B))

OUTPUT:

TIME A:  1.002193902921863
TIME B:  0.0581539266044274

Referring to the example above, why is Test A by a factor of 10-20 slower than Test B? To my understanding the number of FLOPS should be the exact same. I expected that B is slightly slower, due to additional overhead calling the Python-C API multiple times. I evaluated the test multiple times, also changing the order of test A and test B e.g. to avoid lazy evaluation effects of initializing matrix C.

I don't know much, but wouldn't `B = np.random.rand(1, 3, 4)` be more appropriate? — no comment, Sep 10 '21 at 08:52
Indeed, this made all the difference. The result of the matrix multiplication, however, if using `B=np.random.rand(1,3,4)` or `B=np.random.rand(3,4) is the same. An explanation of why the latter is so much faster would be much appreciated! — kleka, Sep 10 '21 at 09:02
The codes are not equivalent: the second should use `X = np.empty((100, 3, 100_000))` and `X[_,:,:] = B.dot(C)` to be fair with the first. This is only 3 times faster than the first with that, probably due to a BLAS issue. — Jérôme Richard, Sep 10 '21 at 10:33

Large matrix multiplication in numpy slower than multiple small matrix multiplications with the same number of FLOPS

0 Answers0