Why is the GNU scientific library matrix multiplication slower than numpy.matmul?

Question

Why is it that the matrix multiplication with Numpy is much faster than gsl_blas_sgemm from GSL, for instance:

import numpy as np
import time 


N = 1000
M = np.zeros(shape=(N, N), dtype=np.float)

for i in range(N):
    for j in range(N):
        M[i, j] = 0.23 + 100*i + j

tic = time.time()
np.matmul(M, M)
toc = time.time()
print(toc - tic)

gives something between 0.017 - 0.019 seconds, while in C++:

#include <chrono>
#include <iostream>
#include <gsl/gsl_matrix.h>
#include <gsl/gsl_blas.h>

using namespace std::chrono;

int main(void) {

    int N = 1000;

    gsl_matrix_float* M = gsl_matrix_float_alloc(N, N);
    for (int i = 0; i < N; i++) {
        for (int j = 0; j < N; j++) {
            gsl_matrix_float_set(M, i, j, 0.23 + 100 * i + j);
        }
    }

    gsl_matrix_float* C = gsl_matrix_float_alloc(N, N); // save the result into C

    auto start = high_resolution_clock::now();

    gsl_blas_sgemm(CblasNoTrans, CblasNoTrans, 1.0, M, M, 0.0, C);

    auto stop = high_resolution_clock::now();
    auto duration = duration_cast<milliseconds>(stop - start);
    std::cout << duration.count() << std::endl;

    return 0;
}

I get a runtime of the multiplication of about 2.7 seconds. I am also compiling with the maximum speed option /02. I am working with Visual Studio. I must do something very wrong. I was not expecting a much better performance from the C++ code because I am aware that Numpy is optimized C-Code, but neither was I expecting it to be about 150 times slower than python. Why is that? How can I improve the runtime of the multiplication relative to Numpy?

Background of the problem: I need to evaluate an 1000 to 2000 dimensional integral, and I am doing it with the Monte-Carlo method. For that I wrote almost the whole integrand as Numpy array operations, this works quite fast but i need it even faster in order to evaluate the same integrand 100.000 to 500.000 times, so any little improvement would help. Does it make sense to write the same code in C/C++ or should I stick to Numpy? Thanks!

Jérôme Richard · Accepted Answer · 2021-06-06T19:52:24.267

TL;DR: the C++ code and Numpy do not use the same matrix-multiplication library.

The matrix multiplication of the GSL library is not optimized. On my machine, it runs sequentially, does not use SIMD instructions (SSE/AVX), does not efficiently unroll the loops to perform register tiling. I also suspect it also does not use the CPU cache efficiently due to the lack of tiling. These optimizations are critical to achieve high-performance and widely used in fast linear algebra libraries.

Numpy uses a BLAS library installed on your machine. On many Linux platform, its uses OpenBLAS or the Intel MKL. Both are very fast (they use all the methods described above) and should run in parallel.

You can find which implementation of BLAS is used by Numpy here. On my Linux machine, Numpy use by default CBLAS which internally use OpenBLAS (OpenBLAS is strangely not directly detected by Numpy).

There are many fast parallel BLAS implementations (GotoBLAS, ATLAS, BLIS, etc.). The open-source BLIS library is great because its matrix multiplication is very fast on many different architectures.

As a result, the simplest way to improve your C++ code is to use the cblas_sgemm CBLAS function and link a fast BLAS library like OpenBLAS or BLIS for example.

For more information:

One simple way to see how bad the GSL perform is to use a profiler (like perf on Linux or VTune on Windows). In your case Linux perf, report that >99% of the time is spent in libgslcblas.so (ie. the GSL library). More specifically, most of the execution time is spent in this following assembly loop:

250:   movss   (%rdx),%xmm1
       add     $0x4,%rax
       add     $0x4,%rdx
       mulss   %xmm2,%xmm1           # scalar instructions
       addss   -0x4(%rax),%xmm1
       movss   %xmm1,-0x4(%rax)
       cmp     %rax,%r9
     ↑ jne     250

As for Numpy, 99% of its time is spent in libopenblasp-r0.3.13.so (ie. the OpenBLAS library). More specifically in the following assembly code of the function dgemm_kernel_HASWELL:

110:   lea          0x80(%rsp),%rsi 
       add          $0x60,%rsi 
       mov          %r12,%rax 
       sar          $0x3,%rax 
       cmp          $0x2,%rax 
     ↓ jl           d26 
       prefetcht0   0x200(%rdi)          # Data prefetching
       vmovups      -0x60(%rsi),%ymm1 
       prefetcht0   0xa0(%rsi)
       vbroadcastsd -0x80(%rdi),%ymm0    # Fast SIMD instruction (AVX)
       prefetcht0   0xe0(%rsi)
       vmovups      -0x40(%rsi),%ymm2 
       prefetcht0   0x120(%rsi)
       vmovups      -0x20(%rsi),%ymm3 
       vmulpd       %ymm0,%ymm1,%ymm4
       prefetcht0   0x160(%rsi)
       vmulpd       %ymm0,%ymm2,%ymm8 
       vmulpd       %ymm0,%ymm3,%ymm12 
       prefetcht0   0x1a0(%rsi)
       vbroadcastsd -0x78(%rdi),%ymm0 
       vmulpd       %ymm0,%ymm1,%ymm5 
       vmulpd       %ymm0,%ymm2,%ymm9 
       [...]

We can clearly see that the GSL code is not optimized (because of the scalar code and the naive simple loop) and that OpenBLAS code is optimized as it uses at least wide SIMD instructions, data prefetching and loop unroling. Note that the executed OpenBLAS code is not optimal as it could use the FMA instructions available on my processor.

Why is the GNU scientific library matrix multiplication slower than numpy.matmul?

1 Answers1