cython_blas level 1 routine orders of magnitude faster than Cython for loop

Question

I've come across a performance difference between a call to cblas (namely daxpy: perform y += alpha * x where y and x are vectors of the same length, and alpha is a scalar) and the same operation performed in pure cython.

Here's my benchmark, using this SO question to profile cython code line by line:

%load_ext line_profiler
import line_profiler
%load_ext Cython

from Cython.Compiler.Options import directive_defaults

directive_defaults['linetrace'] = True
directive_defaults['binding'] = True

The piece of code I want to benchmark:

%%cython -a -f --compile-args=-DCYTHON_TRACE=1


import numpy as np
cimport cython
from scipy.linalg.cython_blas cimport daxpy


@cython.boundscheck(False)
def bench(double[:, ::1] G, double[::1, :] Q, double[:] x):
    cdef int T = G.shape[1]
    cdef int P = G.shape[0]

    cdef int inc = 1  # increment for blas
    cdef int p
    cdef int t
    cdef int i = 42

    for _ in range(50):  # 50 repetitions
    # First version:
        for p in range(P):
            for t in range(T):
                G[p, t] += Q[p, i] * x[t]

        # second version
        for p in range(P):
            daxpy(&T, &Q[p, i], &x[0], &inc, &G[p, 0], &inc)

The results of the benchmark are:

Timer unit: 1e-06 s

Total time: 18.543 s
File: /home/mathurin/.cache/ipython/cython/_cython_magic_32a150cd3ff68e78f896ad9eb33dda69.pyx
Function: bench at line 9

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     9                                           def bench(double[:, ::1] G, double[::1, :] Q, double[:] x):
    10         1            1      1.0      0.0      cdef int T = G.shape[1]
    11         1            0      0.0      0.0      cdef int P = G.shape[0]
    12                                               
    13         1            1      1.0      0.0      cdef int inc = 1  # increment for blas
    14                                               cdef int p
    15                                               cdef int t
    16         1            0      0.0      0.0      cdef int i = 42
    17                                               # First version:
    18         1            0      0.0      0.0      for _ in range(50):  # 50 repetitions
    19        50           22      0.4      0.0          for p in range(P):
    20      9000         2512      0.3      0.0              for t in range(T):
    21  63000000     **18002618**      0.3     97.1                  G[p, t] += Q[p, i] * x[t]
    22                                           
    23                                                   # second version
    24        50           15      0.3      0.0          for p in range(P):
    25      9000       **537865**     59.8      2.9              daxpy(&T, &Q[p, i], &x[0], &inc, &G[p, 0], &inc)

Line 21 and 25 show that the for loop is 40 times slower than the blas call. If I'm not mistaken, I'm iterating over the arrays in the correct order (to maximize cache hits). I'd expect BLAS to be faster, but not by this much. Is there an obvious explanation I'm missing ?

The snippet used to obtain the timings is:

np.random.seed(2407)
G = np.random.randn(180, 7000)
Q = np.random.randn(50, 50)
Q += Q.T
Q = np.asfortranarray(Q)
x = np.random.randn(50)

import pstats, cProfile
profile = line_profiler.LineProfiler(bench)
profile.runcall(bench, G, Q, x)

profile.print_stats()

The line profiling overhead is possibly suspect here, I tried this as separate functions/timeit and saw only ~3x speedup. Also relevant is which BLAS implementation you are using. — chrisb, Aug 23 '17 at 14:46
@chrisb How do I get to know which version is used ? `np.__config__.show()` gives `blas_opt_info: define_macros = [('HAVE_CBLAS', None)] language = c libraries = ['openblas', 'openblas']` — P. Camilleri, Aug 24 '17 at 09:12
You're using OpenBLAS, which is multi-threaded by default, which could be confusing the line-profiling. Try as separate functions. — chrisb, Aug 24 '17 at 12:58
@chrisb I definitely will, but I realised this difference on an exemple different from my MCVE, were the functions were called separetely — P. Camilleri, Aug 24 '17 at 13:13

cython_blas level 1 routine orders of magnitude faster than Cython for loop

0 Answers0