I've come across a performance difference between a call to cblas (namely daxpy: perform y += alpha * x where y and x are vectors of the same length, and alpha is a scalar) and the same operation performed in pure cython.
Here's my benchmark, using this SO question to profile cython code line by line:
%load_ext line_profiler
import line_profiler
%load_ext Cython
from Cython.Compiler.Options import directive_defaults
directive_defaults['linetrace'] = True
directive_defaults['binding'] = True
The piece of code I want to benchmark:
%%cython -a -f --compile-args=-DCYTHON_TRACE=1
import numpy as np
cimport cython
from scipy.linalg.cython_blas cimport daxpy
@cython.boundscheck(False)
def bench(double[:, ::1] G, double[::1, :] Q, double[:] x):
cdef int T = G.shape[1]
cdef int P = G.shape[0]
cdef int inc = 1 # increment for blas
cdef int p
cdef int t
cdef int i = 42
for _ in range(50): # 50 repetitions
# First version:
for p in range(P):
for t in range(T):
G[p, t] += Q[p, i] * x[t]
# second version
for p in range(P):
daxpy(&T, &Q[p, i], &x[0], &inc, &G[p, 0], &inc)
The results of the benchmark are:
Timer unit: 1e-06 s
Total time: 18.543 s
File: /home/mathurin/.cache/ipython/cython/_cython_magic_32a150cd3ff68e78f896ad9eb33dda69.pyx
Function: bench at line 9
Line # Hits Time Per Hit % Time Line Contents
==============================================================
9 def bench(double[:, ::1] G, double[::1, :] Q, double[:] x):
10 1 1 1.0 0.0 cdef int T = G.shape[1]
11 1 0 0.0 0.0 cdef int P = G.shape[0]
12
13 1 1 1.0 0.0 cdef int inc = 1 # increment for blas
14 cdef int p
15 cdef int t
16 1 0 0.0 0.0 cdef int i = 42
17 # First version:
18 1 0 0.0 0.0 for _ in range(50): # 50 repetitions
19 50 22 0.4 0.0 for p in range(P):
20 9000 2512 0.3 0.0 for t in range(T):
21 63000000 **18002618** 0.3 97.1 G[p, t] += Q[p, i] * x[t]
22
23 # second version
24 50 15 0.3 0.0 for p in range(P):
25 9000 **537865** 59.8 2.9 daxpy(&T, &Q[p, i], &x[0], &inc, &G[p, 0], &inc)
Line 21 and 25 show that the for loop is 40 times slower than the blas call. If I'm not mistaken, I'm iterating over the arrays in the correct order (to maximize cache hits). I'd expect BLAS to be faster, but not by this much. Is there an obvious explanation I'm missing ?
The snippet used to obtain the timings is:
np.random.seed(2407)
G = np.random.randn(180, 7000)
Q = np.random.randn(50, 50)
Q += Q.T
Q = np.asfortranarray(Q)
x = np.random.randn(50)
import pstats, cProfile
profile = line_profiler.LineProfiler(bench)
profile.runcall(bench, G, Q, x)
profile.print_stats()