Generating SIMD instructions from Cython code

Question

I need to get an overview of the performance one can get from using Cython in high performance numerical code. One of the thing I am interested in is to find out if an optimizing C compiler can vectorize code generated by Cython. So I decided to write the following small example:

import numpy as np
cimport numpy as np
cimport cython

@cython.boundscheck(False)
@cython.wraparound(False)
cpdef int f(np.ndarray[int, ndim = 1] f):
    cdef int array_length =  f.shape[0]
    cdef int sum = 0
    cdef int k
    for k in range(array_length):
        sum += f[k]
    return sum

I know that there are Numpy functions that does the job, but I would like to have an easy code in order to understand what is possible with Cython. It turns out that the code generated with:

from distutils.core import setup
from Cython.Build import cythonize

setup(ext_modules = cythonize("sum.pyx"))

and called with:

python setup.py build_ext --inplace

generates a C code which look likes this for the loop:

for (__pyx_t_2 = 0; __pyx_t_2 < __pyx_t_1; __pyx_t_2 += 1) {
  __pyx_v_sum = __pyx_v_sum + (*(int *)((char *) 
    __pyx_pybuffernd_f.rcbuffer->pybuffer.buf +
    __pyx_t_2 * __pyx_pybuffernd_f.diminfo[0].strides)));
}

The main problem with this code is that the compiler does not know at compile time that __pyx_pybuffernd_f.diminfo[0].strides is such that the elements of the array are close together in memory. Without that information, the compiler cannot vectorize efficiently.

Is there a way to do such a thing from Cython?

Your indexing isn't efficient: http://cython.readthedocs.io/en/latest/src/tutorial/numpy.html#efficient-indexing — ead, Mar 01 '18 at 21:45
@ead: I have tried to add `np.ndarray[int, ndim = 1]` but it makes no difference. — InsideLoop, Mar 01 '18 at 21:53
The big issue (I think) is that `sum` is typed as a Python object, not as a numeric type — DavidW, Mar 01 '18 at 22:00
@DavidW: Thanks. I am so stupid. I'll see how it changes the result. Working on it... — InsideLoop, Mar 01 '18 at 22:06
Numba vectorizes this code without problems... I have also tried to get really performant code with cython, but failed. I mostly use Numba but trying also C in combination with Agner Fog Vector Library on really critical non simple things. — max9111, Mar 01 '18 at 22:27
I‘m not sure you saw my last edit, but this probably answers you (current) question — ead, Mar 01 '18 at 23:16

score 11 · Accepted Answer · edited Sep 01 '22 at 17:41

You have two problems in your code (use option -a to make it visible):

The indexing of numpy array isn't efficient
You have forgotten int in cdef sum=0

Taking this into account we get:

cpdef int f(np.ndarray[np.int_t] f):  ##HERE
    assert f.dtype == np.int
    cdef int array_length =  f.shape[0]
    cdef int sum = 0                  ##HERE
    cdef int k
    for k in range(array_length):
        sum += f[k]
    return sum

For the loop the following code:

int __pyx_t_5;
int __pyx_t_6;
Py_ssize_t __pyx_t_7;
....
__pyx_t_5 = __pyx_v_array_length;
for (__pyx_t_6 = 0; __pyx_t_6 < __pyx_t_5; __pyx_t_6+=1) {
   __pyx_v_k = __pyx_t_6;
   __pyx_t_7 = __pyx_v_k;
   __pyx_v_sum = (__pyx_v_sum + (*__Pyx_BufPtrStrided1d(__pyx_t_5numpy_int_t *, __pyx_pybuffernd_f.rcbuffer->pybuffer.buf, __pyx_t_7, __pyx_pybuffernd_f.diminfo[0].strides)));
}

Which is not that bad, but not as easy for the optimizer as the normal code written by human. As you have already pointed out, __pyx_pybuffernd_f.diminfo[0].strides isn't known at compile time and this prevents vectorization.

However, you would get better results, when using typed memory views, i.e:

cpdef int mf(int[::1] f):
    cdef int array_length =  len(f)
...

which leads to a less opaque C-code - the one, at least my compiler, can better optimize:

 __pyx_t_2 = __pyx_v_array_length;
  for (__pyx_t_3 = 0; __pyx_t_3 < __pyx_t_2; __pyx_t_3+=1) {
    __pyx_v_k = __pyx_t_3;
    __pyx_t_4 = __pyx_v_k;
    __pyx_v_sum = (__pyx_v_sum + (*((int *) ( /* dim=0 */ ((char *) (((int *) __pyx_v_f.data) + __pyx_t_4)) ))));
  }

The most crucial thing here, is that we make it clear to the cython, that the memory is continuous, i.e. int[::1] compared to int[:] as it is seen for numpy-arrays, for which a possible stride!=1 must be taken into account.

In this case, the cython-generated C-code results in the same assembler as the code I would have written. As crisb has pointed out, adding -march=native would lead to vectorization, but in this case the assembler of both functions would be slightly different again.

However, in my experience, compilers have quite often some problems to optimize loops created by cython and/or it is easier to miss a detail which prevents the generation of really good C-code. So my strategy for working-horse-loops is to write them in plain C and use cython for wrapping/accessing them - often it is somewhat faster, because one can also use dedicated compiler flags for this code-snipped without affecting the whole Cython-module.

Good answer - FYI if you add `-march=native` to either of your examples on godbolt GCC is smart enough to generate SIMD instructions for this case. — chrisb, Mar 02 '18 at 14:46
found it - extra_compile_args=['-fopenmp', '-march=native'], extra_link_args=['-fopenmp', '-march=native'], https://stackoverflow.com/questions/30631091/optimizing-simple-cpu-bound-loops-using-cython-and-replacing-a-list — Dan Erez, Dec 03 '18 at 14:46

Generating SIMD instructions from Cython code

1 Answers1

Linked