Why is it that the matrix multiplication with Numpy is much faster than gsl_blas_sgemm
from GSL, for instance:
import numpy as np
import time
N = 1000
M = np.zeros(shape=(N, N), dtype=np.float)
for i in range(N):
for j in range(N):
M[i, j] = 0.23 + 100*i + j
tic = time.time()
np.matmul(M, M)
toc = time.time()
print(toc - tic)
gives something between 0.017 - 0.019 seconds, while in C++:
#include <chrono>
#include <iostream>
#include <gsl/gsl_matrix.h>
#include <gsl/gsl_blas.h>
using namespace std::chrono;
int main(void) {
int N = 1000;
gsl_matrix_float* M = gsl_matrix_float_alloc(N, N);
for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++) {
gsl_matrix_float_set(M, i, j, 0.23 + 100 * i + j);
}
}
gsl_matrix_float* C = gsl_matrix_float_alloc(N, N); // save the result into C
auto start = high_resolution_clock::now();
gsl_blas_sgemm(CblasNoTrans, CblasNoTrans, 1.0, M, M, 0.0, C);
auto stop = high_resolution_clock::now();
auto duration = duration_cast<milliseconds>(stop - start);
std::cout << duration.count() << std::endl;
return 0;
}
I get a runtime of the multiplication of about 2.7 seconds. I am also compiling with the maximum speed option /02
. I am working with Visual Studio. I must do something very wrong. I was not expecting a much better performance from the C++ code because I am aware that Numpy is optimized C-Code, but neither was I expecting it to be about 150 times slower than python. Why is that? How can I improve the runtime of the multiplication relative to Numpy?
Background of the problem: I need to evaluate an 1000 to 2000 dimensional integral, and I am doing it with the Monte-Carlo method. For that I wrote almost the whole integrand as Numpy array operations, this works quite fast but i need it even faster in order to evaluate the same integrand 100.000 to 500.000 times, so any little improvement would help. Does it make sense to write the same code in C/C++ or should I stick to Numpy? Thanks!