0

I'm trying to use Cython to speed up some parts of my Python script. One key section applies functions to a Pandas dataframe; since this is done many times, I wanted to write these functions with Cython for faster calculations. Functions are below, and are in the same Jupyter notebook cell:

%%cython
cimport numpy as np
import numpy as np

cdef double breadth_c_type(np.ndarray[np.float64_t, ndim=1] arr):
    """ Calculates range between the maximum and minimum values of a given list. """
    return (max(arr) - min(arr))

cdef double evenness_c_type(np.ndarray[np.float64_t, ndim=1] arr):
    """ Calculates the sample variance of differences between values in a sorted list. """
    cdef np.ndarray[double] sorted_arr
    cdef list desc_diff
    cdef double m
    cdef double var_res
    sorted_arr = sorted(arr)
    desc_diff = []
    for x in range(len(arr)-1):
        desc_diff.append(sorted_arr[x+1]-sorted_arr[x])
    # following used to avoid usage of numpy
    m = sum(desc_diff) / len(desc_diff)
    var_res = sum((xi - m)**2 for xi in desc_diff) / len(desc_diff)
    return var_res

The notebook cell runs successfully as written, so I thought that both functions compiled successfully. However, this code runs as expected:

%timeit rand_df.apply(breadth_c_type, raw=True)

whereas this code:

%timeit rand_df.apply(evenness_c_type, raw=True)

doesn't run, and returns "NameError: name 'evenness_c_type' is not defined". I get the same results without the %timeit decorator, and the functions don't compile when using 'cpdef' or 'def' in place of 'cdef'. Since I tried to follow the same syntax for both functions, I don't know what's causing the error for evenness_c_type.

EDIT Thanks to @DavidW, I figured out the problems with the evenness_c_type() function. It compiles and runs well, although not as fast as the plain Cython version.

cdef double evenness_c_type(np.ndarray[np.float64_t, ndim=1] arr):
    """ Calculates the population variance of differences between values in a sorted list. """
    cdef np.ndarray [double] desc_diff=np.empty(len(arr)-1, dtype = np.float64)
    arr.sort()
    for x in range(len(arr)-1):
        desc_diff[x]=(arr[x+1]-arr[x])
    return np.var(desc_diff)
Chris M-B
  • 62
  • 8
  • I'm actually slightly surprised either work. Being `cdef` neither should be available from Python (I think... I'm never 100% sure what IPython does). – DavidW Mar 11 '20 at 17:24

1 Answers1

3

In principle neither should work with timeit. timeit accepts a Python object and a cdef function is not a Python object. However, under some circumstances Cython will automatically create a conversion from cdef function->Python object (effectively making it cpdef).

The reason it's not compiling with cpdef is because of the generator expression ("closures inside cpdef functions not yet supported")

var_res = sum((xi - m)**2 for xi in desc_diff) / len(desc_diff)

I get error messages saying this, although there's a compiler crash so they're not the most clear.

Replace that with a list-comprehension and it'll be fine (although it doesn't look to optimize-down particularly well)

var_res = sum([(xi - m)**2 for xi in desc_diff]) / len(desc_diff)

My suspicion is that the reason the auto-conversion wasn't generated for the cdef function was this generator expression.

The reason it doesn't compile as a def function is because you specify a return type.

Consider whether you really need to make it cdef/cpdef. Most of the time there's little benefit.

DavidW
  • 29,336
  • 6
  • 55
  • 86
  • I made the list comprehension change you suggested, and now get a "TypeError: Cannot convert list to numpy.ndarray" error message when trying to run pandas `DataFrame.apply(evenness_c_type)` (no problems using `breadth_c_type` instead. As for whether I need to make it `cdef/cpdef`, these functions are being called often in my script, so I want to make them as efficient as possible. I was following a tutorial [https://pandas.pydata.org/pandas-docs/stable/user_guide/enhancingperf.html], and using plain Cython gave a performance increase over Python (~50% faster per loop). – Chris M-B Mar 12 '20 at 20:24
  • OK - but if they're being called through `DataFrame.apply` then they're being called through a Python wrapper anyway, so you won't get any benefit from `cdef`. If you're calling them from elsewhere the `cdef/cpdef` _may_ be worthwhile. Maybe. I don't really know where the TypeError is coming from (and I don't think I have the information to know). – DavidW Mar 12 '20 at 20:34
  • A few suggestions: you know the size of `desc_diff`. Allocate it in one go as a Numpy array rather than building/appending a list. Ditch both uses of `sum`. Either use the Numpy array `arr.sum()` or unroll the loop and write it out manually. Regular for-loops are quick in Cython while playing with lists typically isn't. – DavidW Mar 12 '20 at 20:37
  • Looking at it closer, I think the error you're getting is coming from the `sorted` line, because it returns a list rather than an array. Numpy arrays should have their own sort functions - use those instead. – DavidW Mar 12 '20 at 21:51
  • Thanks for the help, I didn't know that numpy functions were supported in Cython (I only started using Cython this past week). It's not the most efficient, but it works now. – Chris M-B Mar 14 '20 at 01:46