What method underlies numpy's matrix multiplication? My understanding is that it uses BLAS, but I get very different runtimes when I perform a matrix multiplication in a Jupyter notebook vs a direct call to blas::gemm in c++. For example, this code:
import numpy
import time
N = 1000
mat = numpy.random.rand(N, N)
start = time.perf_counter()
m = mat@mat
print(time.perf_counter()-start)
Gives a runtime of about 0.25 sec on my laptop. This code:
float *A_ = new float[n*n], *C_ = new float[n*n];
for (long i = 0; i < n*n; i++)
A_[i] = 1.1;
const char tran = 'N'; float alpha = 1, beta = 0;
st = clock();
blas::gemm(&tran, &tran, &n, &n, &n, &alpha, A_, &n, A_, &n, &beta, C_, &n);
fn = clock();
cout << float(fn-st)/float(CLOCKS_PER_SEC) << endl;
takes about 0.85 sec to run. So, I'm guessing numpy is using something else besides blas::gemm. Does anyone know what it's using under the hood? Thanks in advance.