Cython: understanding what the html annotation file has to say?

Question

After compiling the following Cython code, I get the html file that looks like this:

import numpy as np
cimport numpy as np

cpdef my_function(np.ndarray[np.double_t, ndim = 1] array_a,
                  np.ndarray[np.double_t, ndim = 1] array_b,
                  int n_rows,
                  int n_columns):

    array_a[0:-1:n_columns] = 0
    array_a[n_columns - 1:n_rows * n_columns:n_columns] = 0
    array_a[0:n_columns] = 0
    array_a[n_columns* (n_rows - 1):n_rows * n_columns] = 0
    array_b[array_a == 3] = 0

    return array_a, array_b

My question is that why those operations of my function are still yellow? Does this mean that the code is still not as fast as it could be using Cython?

Did you expand any of those lines with the `+`? Useful documentation; http://cython.readthedocs.io/en/latest/src/quickstart/cythonize.html#determining-where-to-add-types — hpaulj, Jan 16 '18 at 03:22
Yes I did. I understand that they are translated C code but can't figure out why they are still yellow after all the types are declared. — Behzad Jamali, Jan 16 '18 at 03:30
It's not just a matter of declaring. Even if it knows that `array_a` is a `ndarray`, it still has to use `numpy` calls (at the C level) to perform that indexing. Look at the `typed memoryview` page to get some ideas on how you can operate on the arrays as 'native' cython objects. — hpaulj, Jan 16 '18 at 03:36
Ok, I now see what you mean. I'm reading about `memoryview`, thanks for the comment. — Behzad Jamali, Jan 16 '18 at 03:49
@hpaulj How about creating a memory view of `array_a` inside the function and do the operations on them: `cdef double [:] v_a = array_a` operations: `v_a[0:-1:n_columns] = 0` — Behzad Jamali, Jan 16 '18 at 04:49

ead · Accepted Answer · 2018-01-16T21:41:34.323

As you already know, yellow lines mean that some interactions with python happen, i.e. python functionality and not raw c-functionality is used, and you can look into the produced code to see, what happens and if it can/should be fixed/avoided.

Not every interaction with python means a (measurable) slowdown.

Let's take a look at this simplified function:

%%cython
cimport numpy as np
def use_slices(np.ndarray[np.double_t] a):
    a[0:len(a)]=0.0

When we look into the produced code we see (I kept only the important parts):

  __pyx_t_1 = PyObject_Length(((PyObject *)__pyx_v_a)); 
  __pyx_t_2 = PyInt_FromSsize_t(__pyx_t_1); 
  __pyx_t_3 = PySlice_New(__pyx_int_0, __pyx_t_2, Py_None); 
  PyObject_SetItem(((PyObject *)__pyx_v_a)

So basically we get a new slice (which is a numpy-array) and then use numpy's functionality (PyObject_SetItem) to set all elements to 0.0, which is C-code under the hood.

Let's take a look at version with hand-written for loop:

cimport numpy as np
def use_for(np.ndarray[np.double_t] a):
    cdef int i
    for i in range(len(a)):
        a[i]=0.0

It still uses PyObject_Length (because of length) and bound-checking, but otherwise it is C-code. When we compare times:

>>> import numpy as np
>>> a=np.ones((500,))
>>> %timeit use_slices(a)
100000 loops, best of 3: 1.85 µs per loop
>>> %timeit use_for(a)
1000000 loops, best of 3: 1.42 µs per loop

>>> b=np.ones((250000,))
>>> %timeit use_slices(b)
10000 loops, best of 3: 104 µs per loop
>>> %timeit use_for(b)
1000 loops, best of 3: 364 µs per loop

You can see the additional overhead of creating a slice for small sizes, but the additional checks in the for-version means it has more overhead in the long run.

Let's disable these checks:

%%cython
cimport cython
cimport numpy as np
@cython.boundscheck(False)
@cython.wraparound(False)
def use_for_no_checks(np.ndarray[np.double_t] a):
    cdef int i
    for i in range(len(a)):
        a[i]=0.0

In the produced html we can see, that a[i] gets as simple as it gets:

 __pyx_t_3 = __pyx_v_i;
    *__Pyx_BufPtrStrided1d(__pyx_t_5numpy_double_t *, __pyx_pybuffernd_a.rcbuffer->pybuffer.buf, __pyx_t_3, __pyx_pybuffernd_a.diminfo[0].strides) = 0.0;
  }

__Pyx_BufPtrStrided1d(type, buf, i0, s0) is define for (type)((char*)buf + i0 * s0). And now:

>>> %timeit use_for_no_checks(a)
1000000 loops, best of 3: 1.17 µs per loop
>>> %timeit use_for_no_checks(b)
1000 loops, best of 3: 246 µs per loop

We can improve it further by releasing gil in the for-loop:

%%cython
cimport cython
cimport numpy as np
@cython.boundscheck(False)
@cython.wraparound(False)
def use_for_no_checks_no_gil(np.ndarray[np.double_t] a):
    cdef int i
    cdef int n=len(a)
    with nogil:
      for i in range(n):
        a[i]=0.0

and now:

>>> %timeit use_for_no_checks_no_gil(a)
1000000 loops, best of 3: 1.07 µs per loop
>>> %timeit use_for_no_checks_no_gil(b)
10000 loops, best of 3: 166 µs per loop

So it is somewhat faster, but still you cannot beat numpy for larger arrays.

In my opinion, there are two things to take from it:

Cython will not transform slices to an access through a for-loop and thus Python-functionality must be used.
There is small overhead, but it is only calling the numpy-functionality the most of work is done in numpy-code, and this cannot be speed up through Cython.

One last try using memset function:

%%cython
from libc.string cimport memset
cimport numpy as np
def use_memset(np.ndarray[np.double_t] a):
    memset(&a[0], 0, len(a)*sizeof(np.double_t))

We get:

>>> %timeit use_memset(a)
1000000 loops, best of 3: 821 ns per loop
>>> %timeit use_memset(b)
10000 loops, best of 3: 102 µs per loop

It is also as fast as the numpy-code for large arrays.

As DavidW suggested, one could try to use memory-views:

%%cython
cimport numpy as np
def use_slices_memview(double[::1] a):
    a[0:len(a)]=0.0

leads to a slightly faster code for small arrays but similar fast code fr large arrays (compared to numpy-slices):

>>> %timeit use_slices_memview(a)
1000000 loops, best of 3: 1.52 µs per loop

>>> %timeit use_slices_memview(b)
10000 loops, best of 3: 105 µs per loop

That means, that the memory-view slices have less overhead than the numpy-slices. Here is the produced code:

 __pyx_t_1 = __Pyx_MemoryView_Len(__pyx_v_a); 
  __pyx_t_2.data = __pyx_v_a.data;
  __pyx_t_2.memview = __pyx_v_a.memview;
  __PYX_INC_MEMVIEW(&__pyx_t_2, 0);
  __pyx_t_3 = -1;
  if (unlikely(__pyx_memoryview_slice_memviewslice(
    &__pyx_t_2,
    __pyx_v_a.shape[0], __pyx_v_a.strides[0], __pyx_v_a.suboffsets[0],
    0,
    0,
    &__pyx_t_3,
    0,
    __pyx_t_1,
    0,
    1,
    1,
    0,
    1) < 0))
{
    __PYX_ERR(0, 27, __pyx_L1_error)
}

{
      double __pyx_temp_scalar = 0.0;
      {
          Py_ssize_t __pyx_temp_extent = __pyx_t_2.shape[0];
          Py_ssize_t __pyx_temp_idx;
          double *__pyx_temp_pointer = (double *) __pyx_t_2.data;
          for (__pyx_temp_idx = 0; __pyx_temp_idx < __pyx_temp_extent; __pyx_temp_idx++) {
            *((double *) __pyx_temp_pointer) = __pyx_temp_scalar;
            __pyx_temp_pointer += 1;
          }
      }
  }
  __PYX_XDEC_MEMVIEW(&__pyx_t_2, 1);
  __pyx_t_2.memview = NULL;
  __pyx_t_2.data = NULL;

I think the most important part: this code doesn't create an additional temporary object - it reuses the existing memory view for the slice.

My compiler produces (at least for my machine) a slightly faster code if memory views are used. Not sure whether it is worth an investigation. At the first sight the difference in every iteration step is:

# created code for memview-slices:
*((double *) __pyx_temp_pointer) = __pyx_temp_scalar;
 __pyx_temp_pointer += 1;

#created code for memview-for-loop:
 __pyx_v_i = __pyx_t_3;
 __pyx_t_4 = __pyx_v_i;
 *((double *) ( /* dim=0 */ ((char *) (((double *) data) + __pyx_t_4)) )) = 0.0;

I would expect different compilers handle this code differently well. But clearly, the first version is easier to get optimized.

As Behzad Jamali pointed out, there is difference between double[:] a and double[::1] a. The second version using slices is about 20% faster on my machine. The difference is, that during the compile time it is know for the double[::1] version, that the memory-accesses will be consecutive and this can be used for optimization. In the version with double[:] we don't know anything about stride until the runtime.

As far as I understood if we don't have yellow lines we can use `nogil`. Is this also an important consideration? — Behzad Jamali, Jan 16 '18 at 12:35
@BehzadJamali Thanks, I have forgotten to release gil, so this would speed-up the for-version somewhat (see my edit). However, it still cannot beat numpy. If I'm not mistaken numpy releases gil for this kind of operations. — ead, Jan 16 '18 at 12:58
It might be worth quickly looking at memoryview slices too. I doubt they'd be any different in speed but it's possible that they might be naturally gil-free. (That said, I doubt it's as important to release the gil as OP things) — DavidW, Jan 16 '18 at 14:48
@DavidW I took a look at memory view slices and added some details to the answer. Surprisingly, they have less overhead than the numpy-slices and produce a slightly different c-code which can be optimized better at least with my compiler. — ead, Jan 16 '18 at 16:03
Why you used double[: : 1] a in the memoryview part. Why not double[:] a? — Behzad Jamali, Jan 16 '18 at 21:15
@BehzadJamali Basically I give the information that the compiler, that the data is continuous in the memory. That helps compiler to optimize code, thus `double[::1]` can be sometimes faster (but never slower). Actually for this example `double[::1]` is somewhat faster. — ead, Jan 16 '18 at 21:43
Can I also benefit from memory views for `array_b[array_a == 3] = 0`? — Behzad Jamali, Jan 16 '18 at 22:03
@BehzadJamali I don’t know. You will have to measure it, to be certain. But I would not dwell too long on it, because the differences are not very big, and use whatever is easier to maintain. — ead, Jan 16 '18 at 22:17

Cython: understanding what the html annotation file has to say?

1 Answers1

Linked