Why is this Cython implementation of this simple appending function slower than Numba?

Question

I have 2 versions of a function that appends a row to a 2d array; one in Cython and another in Numba.

The performance of the Cython version is a lot slower than the Numba version. I would like to optimise the Cython version so that it performs atleast as well as the Numba version.

I am timing the code with this timer.py modules: import time

class Timer(object):
    def __init__(self, name='', output=print):
        self._name = name
        self._output = output

    def __enter__(self):
        self.start = time.time()
        return self

    def __exit__(self, a, b, c):
        self.end = time.time()
        self.time_taken = self.end - self.start
        self._output('%s Took %0.2fs seconds' % (self._name, self.time_taken))

My append_2d_cython.pyx module is:

#!python
#cython: boundscheck=False
#cython: wraparound=False


import numpy as np
cimport numpy as cnp

cnp.import_array()  # needed to initialize numpy-API


cpdef empty_2d(int d1, int d2):
    cdef:
        cnp.npy_intp[2] dimdim

    dimdim[0] = d1
    dimdim[1] = d2

    return cnp.PyArray_SimpleNew(2, dimdim, cnp.NPY_INT32)


cpdef append_2d(int[:, :] arr, int[:] value):

    cdef int[:, :] result

    result = empty_2d(arr.shape[0]+1, arr.shape[1])

    result[:arr.shape[0], :] = arr

    result[arr.shape[0], :] = value

    return result

My append_2d_numba.py module is:

import numba as nb
import numpy as np


@nb.jit(nopython=True)
def append_2d(arr, value):

    result = np.empty((arr.shape[0]+1, arr.shape[1]), dtype=arr.dtype)
    result[:-1] = arr
    result[-1] = value
    return result

I am comparing the Numba and Cython versions of append_2d with this script:

import pyximport
import numpy as np
pyximport.install(setup_args={'include_dirs': np.get_include()})

from timer import Timer
from append_2d_cython import append_2d as append_2d_cython
from append_2d_numba import append_2d as append_2d_numba

arr_2d = np.random.randint(0, 100, size=(5, 4), dtype=np.int32)
arr_1d = np.array([0, 1, 2, 3], np.int32)

num_tests = 100000

with Timer('append_2d_cython'):
    for _ in range(num_tests):
        r_cython = append_2d_cython(arr_2d, arr_1d)

# # JIT Compile it
append_2d_numba(arr_2d, arr_1d)

with Timer('append_2d_numba'):
    for _ in range(num_tests):
        r_numba = append_2d_numba(arr_2d, arr_1d)

Which prints:

make many with cython Took 0.36s seconds
make many with numba Took 0.12s seconds

So, for this code, numba is 3 times faster than Cython. I would like to refactor the Cython code to be atleast as fast as the Numba code. How can I do that?

If possible, I would use the built-in `timeit` module to time your code. — juanpa.arrivillaga, Nov 04 '18 at 21:46
Here’s some discussion about profiling cython code: https://stackoverflow.com/questions/28301931/how-to-profile-cython-functions-line-by-line. Have you compiled it with the `annotate` option enabled to see if Gil is slowing it down anywhere? — SuperShoot, Nov 04 '18 at 21:47
Why do you think it is possible for cython to be faster than numba for this? Maybe numba is simply faster... — zvone, Nov 04 '18 at 22:00
@zvone It might be. But if we don't try to improve we'll never get any better. ;P — Ginger, Nov 04 '18 at 22:07
I think there's actually a reasonable amount of work in getting a memoryview from a numpy array. They aren't great for small functions that you'll call often from Python. In my experience Numba wins convincingly for this type of function and there's not much you can do. — DavidW, Nov 04 '18 at 22:10
1) You can get much better performance if you use overallocation. (Take a look how std:vector in C++ is implemented eg. http://www.drdobbs.com/c-made-easier-how-vectors-grow/184401375). An implementation like this should also be straight forward for nd-arrays as long as you are only appending to the slowest changing dim. 2) I don't think that is realy relevant here, but your Numba solution takes (automatically) into acount that you use contigous arrays, your Cython solution does not. eg. int[::1] ->contigous int[:] can be strided. https://stackoverflow.com/a/50383529/4045774 — max9111, Nov 05 '18 at 08:53

score 3 · Accepted Answer · answered Nov 05 '18 at 21:43

This investigation will show, that the large Cython overhead is the reason for Cython's bad performance. Furthermore, a (somewhat hacky) alternative will be presented to avoid most of it - so the numba solution will be beaten by factor 4.

Let's start by establishing the base-line on my machine (I call your functions cy_append_2d and nb_append_2d and use %timeit magic to measure the running times):

arr_2d = np.arange(5*4, dtype=np.int32).reshape((5,4))
arr_1d = np.array([0, 1, 2, 3], np.int32)

%timeit cy_append_2d(arr_2d, arr_1d)
# 8.27 µs ± 141 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit nb_append_2d(arr_2d, arr_1d)
# 2.84 µs ± 169 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Numba-version is about three times faster - similar to the timing you are observing.

However, we must be aware, that what we are measuring isn't the time needed for copying the data, but the overhead. It is not like numba is doing something fancy - it just happens to have less overhead (but still quite a lot - almost 3µs for creating an numpy-array and copying 24 integers!)

If we increase the amount of copied data, we will see that cython and numba have pretty similar performances - no fancy compiler can improve copying by much:

N=5000
arr_2d_large = np.arange(5*N, dtype=np.int32).reshape((5,N))
arr_1d_large = np.arange(N, dtype=np.int32)
%timeit cy_append_2d(arr_2d_large, arr_1d_large)
# 35.7 µs ± 597 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit nb_append_2d(arr_2d_large, arr_1d_large)
# 44.8 µs ± 1.36 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

here Cython is slightly faster, but for different machines and different sizes this can vary, for our purposes we can consider them being almost equally fast.

As @DavidW has pointed out, creating cython-arrays out of cython-ndarray out of numpy arrays brings quite some overhead. Consider this dummy function:

%%cython
cpdef dummy(int[:, :] arr, int[:] value):
    pass

%timeit dummy(arr_2d, arr_1d)
# 3.24 µs ± 47.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

That means out of the original 8µs 3 are already spent before the first operation in the function is started - here you can see the costs of creating the memory views.

Normally, one wouldn't care about this overhead - because if you call numpy functionality for such small data-chunks the performance will not be stellar anyway.

However, if you are really into this kind of micro-optimization, you could use Numpy's C-API directly, without taking the Cythons ndarray-helper. We cannot expect the result to be as fast as copying 24 integers - for that creating a new buffer/numpy-array is just to costly, however our chances to beat 8µs are pretty high!

Here is a prototype, which shows what could be possible possible:

%%cython
from libc.string cimport memcpy

# don't use Cythons wrapper, because it uses ndarray
# define only the needed stuff
cdef extern from "numpy/arrayobject.h":
    ctypedef int npy_intp            # it isn't actually int, but cython doesn't care anyway
    int _import_array() except -1
    char *PyArray_BYTES(object arr)
    npy_intp PyArray_DIM(object arr, int n)
    object PyArray_SimpleNew(int nd, npy_intp* dims, int typenum)
    cdef enum NPY_TYPES:
        NPY_INT32

# initialize Numpy's C-API when imported.
_import_array()  

def cy_fast_append_2d(upper, lower):
    # create resulting array:
    cdef npy_intp dims[2]
    dims[0] = PyArray_DIM(upper, 0)+1
    dims[1] = PyArray_DIM(upper, 1)
    cdef object res = PyArray_SimpleNew(2, &dims[0], NPY_INT32)
    # copy upper array, assume C-order/memory layout
    cdef char *dest = PyArray_BYTES(res)
    cdef char *source = PyArray_BYTES(upper)
    cdef int size = (dims[0]-1)*dims[1]*4 # int32=4 bytes
    memcpy(dest, source, size)
    # copy lower array, assume C-order/memory layout
    dest += size
    source = PyArray_BYTES(lower)
    size = dims[1]*4
    memcpy(dest, source, size)
    return res

The timings are now:

%timeit cy_fast_append_2d(arr_2d, arr_1d)
753 ns ± 3.13 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

which means Cython beats Numba by factor 4.

However, there is a lot of safety lost - for example it works only for C-order arrays and not for Fortran-order arrays. But my goal wasn't to give a waterproof solution but to investigate how fast using Numpy's C-API directly might become - it's your decision whether this hacky way should be taken.

Why is this Cython implementation of this simple appending function slower than Numba?

1 Answers1