Cython speed vs numpy

Question

I am trying out cython for first time. And tried to convert a function from using pure numpy to cython

Here are the two functions:

from __future__ import division
import numpy as np
cimport numpy as np

DTYPEf = np.float64
ctypedef np.float64_t DTYPEf_t

DTYPEi = np.int64
ctypedef  np.int64_t DTYPEi_t

DTYPEu = np.uint8
ctypedef np.uint8_t DTYPEu_t

cimport cython

@cython.boundscheck(False)
@cython.wraparound(False)

def twodcitera(np.ndarray[DTYPEf_t, ndim=3] data, int res, int indexl, int indexu, float radius1, float radius2, output, float height1, float height2 ):  
'''
Function to return correlation for fixed radius using Cython
'''
cdef float sum_mask = 0
cdef int i,j,k
cdef int a, b, c
cdef np.ndarray[DTYPEi_t, ndim=3] x
cdef np.ndarray[DTYPEi_t, ndim=3] y
cdef np.ndarray[DTYPEi_t, ndim=3] z
cdef np.ndarray[DTYPEu_t, ndim=3, cast=True] R

a,b,c = res//2,res//2,res//2   
x,y,z = np.ogrid[-a:a,-b:b,-c:c]    

for i in xrange(indexl,indexu):
  for j in xrange(1):
    for k in xrange(1):
      R = np.roll(np.roll(np.roll(np.logical_and(np.logical_or(np.logical_and(z>height1,z<=height2), np.logical_and(z<-height1,z>=-height2)), np.logical_and(x**2 + y**2<= radius2**2, x**2 + y**2 > radius1**2)), (i-a), axis =0), (j-a), axis =1), (k-a), axis =2)
      sum_mask += (data[i][j][k] * np.average(data[R]))

output.put(sum_mask)

And for numpy implementation:

def no_twodcitera(data, res, indexl, indexu, radius1, radius2, output, height1, height2 ):  
'''
Function to return correlation for fixed radius
'''
a,b,c = res/2,res/2,res/2    
x,y,z = np.ogrid[-a:a,-b:b,-c:c]    
sum_mask = 0
for i in xrange(indexl,indexu):
  for j in xrange(1):
    for k in xrange(1):
      R = np.roll(np.roll(np.roll(np.logical_and(np.logical_or(np.logical_and(z>height1,z<=height2), np.logical_and(z<-height1,z>=-height2)), np.logical_and(x**2 + y**2<= radius2**2, x**2 + y**2 > radius1**2)), (i-a), axis =0), (j-a), axis =1), (k-a), axis =2)
      sum_mask += (data[i][j][k] * np.average(data[R]))

output.put(sum_mask)

Both functions actually give me same time for completion.

%timeit -n200 -r10 twodcitera(dd, tes_res,in1,in2,r[k],r[k+1], output, r[l], r[l+1])
200 loops, best of 10: 1.57 ms per loop

%timeit -n200 -r10 no_twodcitera(dd, tes_res,in1,in2,r[k],r[k+1], output, r[l], r[l+1])
200 loops, best of 10: 1.57 ms per loop

I was wondering what I was doing wrong or I haven't understood correctly when trying to implement cython. Inputs are:

dd  = np.random.randn(64,64,64) 
res = 64
r   = np.arange(0,21,2)
in1 = 0
in2 = 1
l   = 5
k   = 7
output = mp.Queue()

Thank you if you could point out my misunderstanding here.

xrange for j and k were kept 1 for just testing purpose, in the end it will be j in xrange(res) and k in xrange(res) — MisterJ, Jun 09 '15 at 17:25
did you try running the code with cython -a ? http://docs.cython.org/src/quickstart/cythonize.html#determining-where-to-add-types — Padraic Cunningham, Jun 09 '15 at 17:48
So r = np.arange(0,21,2). in1 and in2 and just two indices to traverse the array for eg, in1 = 0, in2 =1. Radius1 and radius2 are the radii of the circular shell and smiliarly height1 and height2 are the length for the cylindrical shell. I have parallelised the code to run on multiple cores, hence the use of output, which is just output=mp.Queue(). Hope this helps to understand the code better. — MisterJ, Jun 09 '15 at 18:04
`np.roll` , `np.logical_and`, etc: all of those are slow python functions, that won't have much speed-up if you convert them to cython. If you really want a speed up, express it a loop over every indice of the R array to do the same thing. Also `data[i][j][k]` is terrible for Cython speed wise, use `data[i,j,k]` instead. — rth, Jun 09 '15 at 19:41

PlagTag · Answer 1 · 2015-06-11T08:37:04.400

Without knowing your input and output the following compiled for me following the cython guide If you explain how to create test input i could maybe provide more help.

EDIT: My first thought was that there is maybe something up with the cython compilation. But i could not find anything realy helpful. Therfore this answer is not really helpful for improving the speed problem. Anyway i leave it here for those interested in the testing and understanding.

Put the code into test.pyx

cimport cython
import numpy as np
cimport numpy as np

DTYPEf = np.float64
ctypedef np.float64_t DTYPEf_t

DTYPEi = np.int64
ctypedef  np.int64_t DTYPEi_t

DTYPEu = np.uint8
ctypedef np.uint8_t DTYPEu_t


@cython.boundscheck(False)
@cython.wraparound(False)
def twodcitera(np.ndarray[DTYPEf_t, ndim=3] data, int res, int indexl, int indexu, float radius1, float radius2, output, float height1, float height2 ):
    '''
    Function to return correlation for fixed radius using Cython
    '''
    cdef float sum_mask = 0
    cdef int i,j,k
    cdef int a, b, c
    cdef np.ndarray[DTYPEi_t, ndim=3] x
    cdef np.ndarray[DTYPEi_t, ndim=3] y
    cdef np.ndarray[DTYPEi_t, ndim=3] z
    cdef np.ndarray[DTYPEu_t, ndim=3, cast=True] R
    a,b,c = res//2,res//2,res//2
    x,y,z = np.ogrid[-a:a,-b:b,-c:c]
    for i in xrange(indexl,indexu):
        for j in xrange(1):
            for k in xrange(1):
                R = np.roll(np.roll(np.roll(np.logical_and(np.logical_or(np.logical_and(z>height1,z<=height2), np.logical_and(z<-height1,z>=-height2)), np.logical_and(x**2 + y**2<= radius2**2, x**2 + y**2 > radius1**2)), (i-a), axis =0), (j-a), axis =1), (k-a), axis =2)
                sum_mask += (data[i][j][k] * np.average(data[R]))
    output.put(sum_mask)

Create a make file setup.py and put

from distutils.core import setup
from Cython.Build import cythonize

setup(
    name = "testapp",
    ext_modules = cythonize('test.pyx'),  # accepts a glob pattern
    )

Go to the shell and compile it:

$python setup.py build_ext --inplace

Go to ipython and try to import:

from test import *

worked for me to run.

A speed test showed:

In [28]: %timeit -n200 -r10 no_twodcitera(dd, res,in1,in2,r[k],r[k+1], output, r[l], r[l+1])
200 loops, best of 10: 1.29 ms per loop

In [29]: %timeit -n200 -r10 test.twodcitera(dd, res,in1,in2,r[k],r[k+1], output, r[l], r[l+1])
200 loops, best of 10: 1.31 ms per loop

So the results are the same and there is not much difference. I furthermore conducted a cProfile study to see if there is something showing up in the runtime of the call stack. One has to confess, that cProfile gets hard to interpret when in comes to ms seconds speed! But lets give it a try.

In [34]: cProfile.run("""no_twodcitera(dd, res,in1,in2,r[k],r[k+1], output, r[l], r[l+1])""")
         82 function calls in 0.004 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.001    0.001    0.004    0.004 <ipython-input-27-663e142d15fb>:1(no_twodcitera)
        1    0.000    0.000    0.004    0.004 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 _methods.py:43(_count_reduce_items)
        1    0.000    0.000    0.000    0.000 _methods.py:53(_mean)
        1    0.000    0.000    0.000    0.000 function_base.py:436(average)
        1    0.000    0.000    0.000    0.000 index_tricks.py:151(__getitem__)
        3    0.000    0.000    0.002    0.001 numeric.py:1279(roll)
        1    0.000    0.000    0.000    0.000 numeric.py:394(asarray)
        4    0.000    0.000    0.000    0.000 numeric.py:464(asanyarray)
        1    0.000    0.000    0.000    0.000 queues.py:99(put)
        1    0.000    0.000    0.000    0.000 threading.py:299(_is_owned)
        1    0.000    0.000    0.000    0.000 threading.py:372(notify)
        1    0.000    0.000    0.000    0.000 threading.py:63(_note)
        1    0.000    0.000    0.000    0.000 {hasattr}
       18    0.000    0.000    0.000    0.000 {isinstance}
        1    0.000    0.000    0.000    0.000 {issubclass}
        5    0.000    0.000    0.000    0.000 {len}
        3    0.000    0.000    0.000    0.000 {math.ceil}
        1    0.000    0.000    0.000    0.000 {method 'acquire' of '_multiprocessing.SemLock' objects}
        2    0.000    0.000    0.000    0.000 {method 'acquire' of 'thread.lock' objects}
        1    0.000    0.000    0.000    0.000 {method 'append' of 'collections.deque' objects}
        3    0.000    0.000    0.000    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.000    0.000    0.000    0.000 {method 'mean' of 'numpy.ndarray' objects}
        1    0.000    0.000    0.000    0.000 {method 'reduce' of 'numpy.ufunc' objects}
        1    0.000    0.000    0.000    0.000 {method 'release' of 'thread.lock' objects}
        3    0.002    0.001    0.002    0.001 {method 'take' of 'numpy.ndarray' objects}
        9    0.000    0.000    0.000    0.000 {numpy.core.multiarray.arange}
        5    0.000    0.000    0.000    0.000 {numpy.core.multiarray.array}
        3    0.000    0.000    0.000    0.000 {numpy.core.multiarray.concatenate}
        4    0.000    0.000    0.000    0.000 {range}
        1    0.000    0.000    0.000    0.000 {zip}



In [35]: cProfile.run("""test.twodcitera(dd, res,in1,in2,r[k],r[k+1], output, r[l], r[l+1])""")
         82 function calls in 0.003 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.003    0.003 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 _methods.py:43(_count_reduce_items)
        1    0.000    0.000    0.000    0.000 _methods.py:53(_mean)
        1    0.000    0.000    0.000    0.000 function_base.py:436(average)
        1    0.000    0.000    0.000    0.000 index_tricks.py:151(__getitem__)
        3    0.000    0.000    0.001    0.000 numeric.py:1279(roll)
        1    0.000    0.000    0.000    0.000 numeric.py:394(asarray)
        4    0.000    0.000    0.000    0.000 numeric.py:464(asanyarray)
        1    0.000    0.000    0.000    0.000 queues.py:99(put)
        1    0.000    0.000    0.000    0.000 threading.py:299(_is_owned)
        1    0.000    0.000    0.000    0.000 threading.py:372(notify)
        1    0.000    0.000    0.000    0.000 threading.py:63(_note)
        1    0.000    0.000    0.000    0.000 {hasattr}
       18    0.000    0.000    0.000    0.000 {isinstance}
        1    0.000    0.000    0.000    0.000 {issubclass}
        5    0.000    0.000    0.000    0.000 {len}
        3    0.000    0.000    0.000    0.000 {math.ceil}
        1    0.000    0.000    0.000    0.000 {method 'acquire' of '_multiprocessing.SemLock' objects}
        2    0.000    0.000    0.000    0.000 {method 'acquire' of 'thread.lock' objects}
        1    0.000    0.000    0.000    0.000 {method 'append' of 'collections.deque' objects}
        3    0.000    0.000    0.000    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.000    0.000    0.000    0.000 {method 'mean' of 'numpy.ndarray' objects}
        1    0.000    0.000    0.000    0.000 {method 'reduce' of 'numpy.ufunc' objects}
        1    0.000    0.000    0.000    0.000 {method 'release' of 'thread.lock' objects}
        3    0.001    0.000    0.001    0.000 {method 'take' of 'numpy.ndarray' objects}
        9    0.000    0.000    0.000    0.000 {numpy.core.multiarray.arange}
        5    0.000    0.000    0.000    0.000 {numpy.core.multiarray.array}
        3    0.000    0.000    0.000    0.000 {numpy.core.multiarray.concatenate}
        4    0.000    0.000    0.000    0.000 {range}
        1    0.001    0.001    0.003    0.003 {test.twodcitera}
        1    0.000    0.000    0.000    0.000 {zip}

Sadly there is nothing popping up. I would conclude that the reason might be that numpy is already well implemented and most of the time is not lost in the nested loops. Furthermore cPython mostly benefits from static typing. Due to we are using numpy here this might not be a big benefit.

As @moarningsun said, I was able to compile, but was just looking into how I could improve my performance. I have edited my question to clearly state what the inputs I use. Thanks. — MisterJ, Jun 10 '15 at 11:22
you are right guys, sorry for the confusion. I first thought the reason might be the compilation but had no testing data. I also had a look at the runtime profile but there was no much. Maybe one could try scipy weave to speed up the thing or a fortran implementation. — PlagTag, Jun 11 '15 at 08:38

Cython speed vs numpy

1 Answers1

Linked