9

I need to perform a lot of work using 2D numpy arrays of various sizes and I would like to offload these calculations onto cython. The idea is that my 2D numpy arrays would be passed from python to cython where it would be converted into c-array or memory view and used in a cascade of other c-level functions to do the calculations.

After some profiling I ruled out using numpy arrays in cython due to some serious python overhead. Using memory views was MUCH faster and quite easy to use, but I suspect I can squeeze even more speedup from using c-arrays.

Here is my question though - how can I declare a 2D c-array in cython without predefining its dimensions with set values? For example, I can create a c-array from numpy this way:

narr = np.array([[1,2,3,4],[5,6,7,8],[9,10,11,12]], dtype=np.dtype("i"))

cdef int c_arr[3][4]:
for i in range(3):
    for j in range(4):
        c_arr[i][j] = narr[i][j]

and then pass it to a function:

cdef void somefunction(int c_Arr[3][4]):
    ...

But this implies I have a fixed sizde of array, which in my case will be useless. So I tried something like this:

narr = np.array([[1,2,3,4],[5,6,7,8],[9,10,11,12]], dtype=np.dtype("i"))

cdef int a = np.shape(narr)[0]
cdef int b = np.shape(narr)[1]

cdef int c_arr[a][b]:               # INCORRECT - EXAMPLE ONLY

for i in range(a):
    for j in range(b):
        c_arr[i][j] = narr[i][j]

with the intention to pass it to a function like this:

cdef void somefunction(int a, int b, int c_Arr[a][b]):
    ...

But it doesn't work and the compilation fails with the error "Not allowed in a constant expression". I suspect I need t do it with malloc/free somehow? I had a look at this problem (How to declare 2D list in Cython), but it does not provide the answer to my problem.

UPDATE:

It turns out that memory-views can be as fast as c-arrays if one makes sure that indexError checking in cython is switched-off for the memory views, which can be done by using cython compiler directive:

# cython: boundscheck=False

Thanks @Veedrac for the tip!

Community
  • 1
  • 1
phil_thy
  • 151
  • 1
  • 1
  • 10
  • 1
    Also note that whenever you have stupid numbers like "2700x faster" it's because one is getting optimized out completely. The benefits will be much smaller if you actually use the output. – Veedrac Sep 19 '14 at 19:49
  • Ah yes, good point! Also using floats/doubles instead of integers makes an impact too... – phil_thy Sep 22 '14 at 11:20
  • Aside from "# cython: boundscheck=False", you can also seek additional speed-up with "#cython: cython.wraparound=False" – Dane Lee Aug 25 '18 at 08:16

2 Answers2

10

You just need to stop doing bounds checking:

with cython.boundscheck(False):
    thesum += x_view[i,j]

that brings the speed basically up to par.


If you really want a C array from it, try:

import numpy as numpy
from numpy import int32
from numpy cimport int32_t

numpy_array = numpy.array([[]], dtype=int32)

cdef:
    int32_t[:, :] cython_view = numpy_array
    int32_t *c_integers_array = &cython_view[0, 0]
    int32_t[4] *c_2d_array = <int32_t[4] *>c_integers_array

First you get a Numpy array. You use that to get a memory view. Then you get a pointer to its data, which you cast to pointers of the desired stride.

Veedrac
  • 58,273
  • 15
  • 112
  • 169
  • Brilliant, thanks @Veedrac , that did brought me back up to the same speed as c_arrays! – phil_thy Sep 19 '14 at 09:55
  • However, I still wonder about how can you (if you can) generate the 2D c-arrays by dynamically specifying its size? I got as far as declaring "cdef int *carr = malloc(a*b*sizeof(int*))" and then using two for-loops to assign individual array values with "carr[(i*a)+j] = narr[i][j]", but it's still not a proper 2D array, nor I can actually pass it to a function as a 2D array... Anyone? – phil_thy Sep 19 '14 at 10:01
  • Does this answer your question? – Veedrac Sep 19 '14 at 20:43
  • Ha! Yes it does, thank you again! I can see now that the memory views approach is MUCH easier than messing around with c_arrays, and what's more important - just as fast too! Sorted. – phil_thy Sep 22 '14 at 10:45
1

So after invaluable help from @Veedrac (Many thanks!) I finally came up with a script that demonstrates the use of both, memory views and c-arrays to speed up calculations in Cython. They both go down to similar speeds and so I personally think using memory-views is MUCH easier.

Here is an example cython script that "accepts" a numpy array and converts it to memory view or c-array and then performs simple array summation via c-level functions:

# cython: boundscheck=False

cimport cython
import numpy as np
cimport numpy as np

from numpy import int32
from numpy cimport int32_t


#Generate numpy array:
narr = np.array([[1,2,3,4],[5,6,7,8],[9,10,11,12],[13,14,15,16]], dtype=np.dtype("i"))

cdef int a = np.shape(narr)[0]
cdef int b = np.shape(narr)[1]
cdef int i, j

testsum = np.sum(narr)
print "Test summation: np.sum(narr) =", testsum

#Generate the memory view:
cdef int [:,:] x_view = narr

#Generate the 2D c-array and its pointer:
cdef:
    int32_t[:, :] cython_view = narr
    int32_t *c_integers_array = &cython_view[0, 0]
    int32_t[4] *c_arr = <int32_t[4] *>c_integers_array


def test1():

    speed_test_mview(x_view)  

def test2():

    speed_test_carray(&c_arr[0][0], a, b)


cdef int speed_test_mview(int[:,:] x_view):

    cdef int n, i, j, thesum

    # Define the view:
    for n in range(10000):
        thesum = 0
        for i in range(a):
            for j in range(b):
                thesum += x_view[i, j]        


cdef int speed_test_carray(int32_t *c_Arr, int a, int b):

    cdef int n, i, j, thesum
    for n in range(10000):
        thesum = 0
        for i in range(a):
            for j in range(b):
                thesum += c_Arr[(i*b)+j]

Then using ipython shell timing tests reveal similar speeds:

import testlib as t
Test summation: np.sum(narr) = 136

%timeit t.test1()
10000000 loops, best of 3: 46.3 ns per loop

%timeit t.test2()
10000000 loops, best of 3: 46 ns per loop

Oh and for comparison - using numpy arrays in this example took 125 ms (not shown).

zeawoas
  • 446
  • 7
  • 18
phil_thy
  • 151
  • 1
  • 1
  • 10
  • 1
    You're converting `c_arr` to a 1D array by using `&c_arr[0][0]`, but you might as well just have used `c_integers_array`, which is exactly the same thing. The advantage of `c_arr` is that you can pass it to a function that accepts `int[4] *array` or even `int array[4][4]`. – Veedrac Sep 22 '14 at 11:38
  • 1
    Also, if you do proper timings that prevent the non-bounds-checked versions to get optimized out completely, you notice two things. The first is that accessing the globals `a` and `b` is much slower than doing `cdef int a, b; a = x_view.shape[0]; b = x_view.shape[1]` and looping over those. The second is that bounds checking has about a 60% overhead (eg from 1s → 1.6s). – Veedrac Sep 22 '14 at 11:48