Optimizing NumPy with Cython

Question

I am currently trying to optimize the code that I had written in pure Python. This code uses NumPy very heavily as I am working with NumPy arrays. Below you can see the simplest of my classes that I converted to Cython. Which only does a multiplication of two Numpy arrays. Here:

bendingForces = self.matrixPrefactor * membraneHeight

My question is, if and how I can optimize this as, when I look at the C-code that "cython -a" generates has a lot of NumPy-callings, which does not look very efficient.

import numpy as np
cimport numpy as np
ctypedef np.float64_t dtype_t
ctypedef np.complex128_t cplxtype_t
ctypedef Py_ssize_t index_t

    cdef class bendingForcesClass( object ):
        cdef dtype_t bendingRigidity
        cdef np.ndarray matrixPrefactor
        cdef np.ndarray bendingForces

        def __init__( self, dtype_t bendingRigidity, np.ndarray[dtype_t, ndim=2] waveNumbersNorm ):
            self.bendingRigidity = bendingRigidity
            self.matrixPrefactor = -self.bendingRigidity * waveNumbersNorm**2

        cpdef np.ndarray calculate( self, np.ndarray membraneHeight ) :
            cdef np.ndarray bendingForces
            bendingForces = self.matrixPrefactor * membraneHeight
            return bendingForces

The idea I had was to use two for loops and iterate over the entries of the arrays. Perhaps I could use the compiler to optimize this with SIMD-operations?! I tried, which I could compile, but it gave strange results and took forever. Here's the code of the substitute function:

cpdef np.ndarray calculate( self, np.ndarray membraneHeight ) :

    cdef index_t index1, index2 # corresponds to: cdef Py_ssize_t index1, index2
    for index1 in range( self.matrixSize ):
        for index2 in range( self.matrixSize ):
            self.bendingForces[ index1, index2 ] = self.matrixPrefactor.data[ index1, index2 ] * membraneHeight.data[ index1, index2 ]
    return self.bendingForces

This code however, as I said, is really slow and does not function as expected. So what am I doing wrong? What would be the best way to optimize this and remove the NumPy calling operations?

You should probably try to avoid re-writing everything in cython. I would check what parts of your code are the bottlenecks and optimize just those. I'm not sure that you will get huge improvements over numpy's element-wise array multiplication. — JoshAdel, Mar 16 '11 at 21:09
Can you give us an example of input size and output of this function? I know how to optimize it but need to know how you are using it and how fast it is. — fabrizioM, Mar 16 '11 at 21:16
if you are always lusting for performances, consider stepping out of Python. Cython is no magic. — BatchyX, Mar 16 '11 at 21:17
@David Heffernan: It is not hard to beat numpy using Cython in certain cases e.g., see my answer http://stackoverflow.com/questions/4962606/fast-tensor-rotation-with-numpy — jfs, Mar 16 '11 at 23:07
Thanks for the comment: @JoshAdel: I know that I shouldn't do that and I am not trying. My idea was develop in pure python and then optimize only the functions that I actually use in the loop for calculation with Cython (not the setting up of the objects). @fabrizioM: The input and output arrays are of sizes 64x64 to 1024x1024 and are of complex, but (here) with the imaginary part zero. @David Heffernan: I'm not saying I can. My question is how I can reduce function calling overhead when calling the Numpy multiplication. Like I said the C-code from "cython -a" doesn't look too efficient. — packoman, Mar 17 '11 at 09:27
@packoman: My experience is that the c-code cython generates looks like a tangled mess (even for simple things) but still runs fast when the original cython was done correctly/in the most efficient way. My use of cython though has been for single methods, not full classes, and they contain zero numpy methods within the .pyx file. See https://github.com/synapticarbors/pyqcprot/blob/master/pyqcprot.pyx for an example. — JoshAdel, Mar 17 '11 at 12:27

score 9 · Accepted Answer · edited Mar 26 '11 at 20:51

For simple matrix multiplications, the NumPy code is already doing only the looping and multiplying natively, so it would be hard to beat that in Cython. Cython is great for situations where you are replacing loops in Python with those in Cython. One of the reasons your code is slower than NumPy is because every time you do an index lookup in your array,

self.bendingForces[ index1, index2 ] = self.matrixPrefactor.data[ index1, index2 ] * membraneHeight.data[ index1, index2 ]

it does more calculations like bounds checking (the index is valid). If you cast your indices to unsigned ints, you could use the decorator @cython.boundscheck(False) before the function.

See this tutorial for more details in speeding up Cython code.

score 0 · Answer 2 · edited Dec 05 '12 at 00:54

0

You could probably speed this up by using

for index1 from 0 <= index1 < max1:

instead of using a range I am not sure is typed.

Have you checked this out and this one ?

edited Dec 05 '12 at 00:54

Fred Foo

355,277
75
744
836

answered Mar 20 '11 at 00:12

Joost Rekveld

81
2
11

This is the old Cython syntax. Both `range` and `xrange` are optimized away when used in loops. – Fred Foo Nov 22 '12 at 10:25
yes, this is now old Cython syntax, but at the time I posted this it wasn't ? Still good to point out though. – Joost Rekveld Dec 04 '12 at 21:55

Optimizing NumPy with Cython

2 Answers2

Linked