import numpy as np
import time
C = np.random.rand(4, 100000)
# TEST A
times_A = []
for _ in range(10):
t0 = time.perf_counter()
A = np.random.rand(100, 3, 4)
X = A.dot(C)
times_A.append(time.perf_counter() - t0)
# TEST B
times_B = []
for _ in range(10):
t0 = time.perf_counter()
for _ in range(100):
B = np.random.rand(3, 4)
X = B.dot(C)
times_B.append(time.perf_counter() - t0)
print('TIME A: ', np.mean(times_A))
print('TIME B: ', np.mean(times_B))
OUTPUT:
TIME A: 1.002193902921863
TIME B: 0.0581539266044274
Referring to the example above, why is Test A by a factor of 10-20 slower than Test B? To my understanding the number of FLOPS should be the exact same. I expected that B is slightly slower, due to additional overhead calling the Python-C API multiple times. I evaluated the test multiple times, also changing the order of test A and test B e.g. to avoid lazy evaluation effects of initializing matrix C.