What is causing the 2x slowdown in my Cython implementation of matrix vector multiplication?

Question

I am currently trying to implement basic matrix vector multiplication in Cython (as part of a much larger project to reduce computation) and finding that my code is about 2x slower than Numpy.dot.

I am wondering if there is something that I am missing that is resulting in the slowdown. I am writing optimized Cython code, declaring variable types, requiring contiguous arrays, and avoiding cache misses. I even tried having Cython as a wrapper and calling native C code (see below).

I'm wondering: what else could I do to speed up my implementation so runs as quickly as NumPy for this basic operation?

The Cython code that I'm using is beow:

import numpy as np
cimport numpy as np
cimport cython

DTYPE = np.float64;
ctypedef np.float64_t DTYPE_T

@cython.boundscheck(False)
@cython.wraparound(False)
@cython.nonecheck(False)
def matrix_vector_multiplication(np.ndarray[DTYPE_T, ndim=2] A, np.ndarray[DTYPE_T, ndim=1] x):

    cdef Py_ssize_t i, j
    cdef Py_ssize_t N = A.shape[0]
    cdef Py_ssize_t D = A.shape[1]
    cdef np.ndarray[DTYPE_T, ndim=1] y = np.empty(N, dtype = DTYPE)
    cdef DTYPE_T val

    for i in range(N):
        val = 0.0
        for j in range(D):
            val += A[i,j] * x[j]
        y[i] = val
    return y

I am compiling this file (seMatrixVectorExample.pyx) using the following script:

from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext
import numpy as np

ext_modules=[ Extension("seMatrixVectorExample",
                        ["seMatrixVectorExample.pyx"],
                        libraries=["m"],
                        extra_compile_args = ["-ffast-math"])]

setup(
    name = "seMatrixVectorExample",
    cmdclass = {"build_ext": build_ext},
    include_dirs = [np.get_include()],
    ext_modules = ext_modules
)

and using the following test script to assess performance:

import numpy as np
from seMatrixVectorExample import matrix_vector_multiplication
import time

n_rows, n_cols = 1e6, 100
np.random.seed(seed = 0)

#initialize data matrix X and label vector Y
A = np.random.random(size=(n_rows, n_cols))
np.require(A, requirements = ['C'])

x = np.random.random(size=n_cols)
x = np.require(x, requirements = ['C'])

start_time = time.time()
scores = matrix_vector_multiplication(A, x)
print "cython runtime = %1.5f seconds" % (time.time() - start_time)

start_time = time.time()
py_scores = np.exp(A.dot(x))
print "numpy runtime = %1.5f seconds" % (time.time() - start_time)

For a test matrix with n_rows = 10e6 and n_cols = 100 I get:

cython runtime = 0.08852 seconds
numpy runtime = 0.04372 seconds

Edit: It's worth mentioning that the slowdown persists even when I implement the matrix multiplication in native C code, and only use Cython as a wrapper.

void c_matrix_vector_multiplication(double* y, double* A, double* x, int N, int D) {

    int i, j;
    int index = 0;
    double val;

    for (i = 0; i < N; i++) {
        val = 0.0;
        for (j = 0; j < D; j++) {
            val = val + A[index] * x[j];
            index++;
            }
        y[i] = val;
        }
    return; 
}

and here is the Cython wrapper, which just sends the pointer to the first element of y, A and x. :

import cython
import numpy as np
cimport numpy as np

DTYPE = np.float64;
ctypedef np.float64_t DTYPE_T

# declare the interface to the C code
cdef extern void c_multiply (double* y, double* A, double* x, int N, int D)

@cython.boundscheck(False)
@cython.wraparound(False)
@cython.nonecheck(False)
def multiply(np.ndarray[DTYPE_T, ndim=2, mode="c"] A, np.ndarray[DTYPE_T, ndim=1, mode="c"] x):

    cdef int N = A.shape[0]
    cdef int D = A.shape[1]
    cdef np.ndarray[DTYPE_T, ndim=1, mode = "c"] y = np.empty(N, dtype = DTYPE)

    c_multiply (&y[0], &A[0,0], &x[0], N, D)

    return y

[This](http://stackoverflow.com/questions/10442365/why-is-matrix-multiplication-faster-with-numpy-than-with-ctypes-in-python) question/answer seems related, with various reasons given in the top answer. Check it out. — russianfool, Feb 09 '16 at 22:55
@russianfool Thanks! I had actually read through the answers to that question, but the reasons given aren't all that relevant to this problem because I'm dealing with matrix-vector multiplication instead of matrix-matrix multiplication. I'll clarify this in my question. — Berk U., Feb 09 '16 at 23:03
Seems somewhat related to me; namely, read the bits about BLAS/unrolled loops. You can find a matrix-vector multiplication implementation [here](http://www.netlib.org/clapack/cblas/cgemv.c), and it definitely looks like they have various optimized versions based on the data that you're passing in. On a side note, I'm not too familiar with distutils... could you pass in -O2 as one of the extra_compile_args? If it's not compiling with -O2 under the hood, it doesn't make sense to compare performances. — russianfool, Feb 09 '16 at 23:28
Getting within a factor of 2 to the optimized `dot` sounds pretty good. You don't have much say in how `cython` translates your code to `c`. The distinction between matrix-vector multiplication and matrix-matrix is not significant. A vector is just a matrix with a size 1 dimension. The BLAS code is not doing any unnecessary calculations. — hpaulj, Feb 10 '16 at 01:06
Compare with the [numpy source code](https://github.com/numpy/numpy/blob/2f7827702ef6b6ac4b318103d5c0dfe2ff6e7eb3/numpy/core/src/multiarray/cblasfuncs.c#L242). — msw, Feb 10 '16 at 15:37

score 3 · Accepted Answer · answered Feb 10 '16 at 17:02

OK finally managed to get runtimes that are better than NumPy!

Here is what (I think) caused the difference: NumPy is calling BLAS functions, which are coded in Fortran instead of C, resulting in the speed difference.

I think that this is important to note, since I was previously under the impression that the BLAS functions were coded in C and could not see why they would run noticeably faster than the second native C implementation that I posted in the question.

In either case, I can now replicate performance by using Cython + the SciPy Cython BLAS function pointers from scipy.linalg.cython_blas.

For completeness, here is the new Cython code blas_multiply.pyx:

import cython
import numpy as np
cimport numpy as np
cimport scipy.linalg.cython_blas as blas

DTYPE = np.float64
ctypedef np.float64_t DTYPE_T

@cython.boundscheck(False)
@cython.wraparound(False)
@cython.nonecheck(False)

def blas_multiply(np.ndarray[DTYPE_T, ndim=2, mode="fortran"] A, np.ndarray[DTYPE_T, ndim=1, mode="fortran"] x):
    #calls dgemv from BLAS which computes y = alpha * trans(A) + beta * y
    #see: http://www.nag.com/numeric/fl/nagdoc_fl22/xhtml/F06/f06paf.xml

    cdef int N = A.shape[0]
    cdef int D = A.shape[1]
    cdef int lda = N
    cdef int incx = 1 #increments of x
    cdef int incy = 1 #increments of y
    cdef double alpha = 1.0
    cdef double beta = 0.0
    cdef np.ndarray[DTYPE_T, ndim=1, mode = "fortran"] y = np.empty(N, dtype = DTYPE)

    blas.dgemv("N", &N, &D, &alpha, &A[0,0], &lda, &x[0], &incx, &beta, &y[0], &incy)

    return y

Here is the code that I use to build:

!/usr/bin/env python

from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext

import numpy
import scipy

ext_modules=[ Extension("blas_multiply",
                        sources=["blas_multiply.pyx"],
                        include_dirs=[numpy.get_include(), scipy.get_include()],
                        libraries=["m"],
                        extra_compile_args = ["-ffast-math"])]

setup(
    cmdclass = {'build_ext': build_ext},
    include_dirs = [numpy.get_include(), scipy.get_include()],
    ext_modules = ext_modules,
)

And here is the testing code (note that arrays passed to the BLAS function are F_CONTIGUOUS now)

import numpy as np
from blas_multiply import blas_multiply
import time

#np.__config__.show()
n_rows, n_cols = 1e6, 100
np.random.seed(seed = 0)

#initialize data matrix X and label vector Y
X = np.random.random(size=(n_rows, n_cols))
Y = np.random.randint(low=0, high=2, size=(n_rows, 1))
Y[Y==0] = -1
Z = X*Y
Z.flags
Z = np.require(Z, requirements = ['F'])

rho_test = np.random.randint(low=-10, high=10, size= n_cols)
set_to_zero = np.random.choice(range(0, n_cols), size =(np.floor(n_cols/2), 1), replace=False)
rho_test[set_to_zero] = 0.0
rho_test = np.require(rho_test, dtype=Z.dtype, requirements = ['F'])

start_time = time.time()
scores = blas_multiply(Z, rho_test)
print "Cython runtime = %1.5f seconds" % (time.time() - start_time)


Z = np.require(Z, requirements = ['C'])
rho_test = np.require(rho_test, requirements = ['C'])
start_time = time.time()
py_scores = np.exp(Z.dot(rho_test))
print "Python runtime = %1.5f seconds" % (time.time() - start_time)

The result from this test on my machine is:

Cython runtime = 0.04556 seconds
Python runtime = 0.05110 seconds

I have to ask... was it really worth all that effort for a 10% performance bump? The amount of numpy/Python overhead will be roughly constant with respect to the array dimensions, so I would expect to see rapidly diminishing returns as you apply this to larger and larger datasets. Calling BLAS from Cython *might* make sense if you're computing matrix products for a very large number of small matrices (but even in that case you could do pretty well using `np.dot` or `np.matmul`'s built-in vectorization...). For one big matrix product it will probably make almost no difference. — ali_m, Feb 10 '16 at 20:04
@ali_m Haha it was definitely **not** worth it for just matrix-vector multiplication. That said, it was important for me to get this running right / understand what was causing the slowdown because this is a subroutine of a much larger routine that I intend to optimize using Cython (also just frustrated at people pointing to BLAS like some magic black-box without explaining what it was that it was really doing). When I first implemented it, it slow enough that I thought I was doing something very wrong in Cython. — Berk U., Feb 10 '16 at 20:52
There's nothing magic about BLAS, but it does represent the efforts of a bunch of skilled Fortran programmers who are prepared to hand-craft optimized versions of these routines to squeeze out every last drop of performance from some specific model of processor. I'm honestly a bit surprised that you can even get within a factor of two with naive matrix-vector multiplication. A call to an optimized BLAS routine is probably about the best you can do for dense matrix-vector multiplication, aside from maybe doing it on the GPU... — ali_m, Feb 10 '16 at 21:24

What is causing the 2x slowdown in my Cython implementation of matrix vector multiplication?

1 Answers1

Linked