Cython to accelerate loops with array broadcast

Question

Summary:

You guys are too awesome... I got my real code working. I took JoshAdel's advice, namely the following:

1) Changed all ndarray to typed memoryviews 2) Unrolled all the numpy array calculations manually 3) Used statically defined unsigned int for the index 4) Disabled the boundscheck and wraparound

And also, thanks a lot to Veedrac's insight!

Original post:

I know that python do these kind of code really slow:

import numpy as np

def func0():
    x = 0.
    for i in range(1000):
        x += 1.
    return

And if I change this to Cython, it can be much faster:

import numpy as np
cimport numpy as np

def func0():
    cdef double x = 0.
    for i in range(1000):
        x += 1.
    return

And here is the result:

# Python
10000 loops, best of 3: 58.9 µs per loop
# Cython
10000000 loops, best of 3: 66.8 ns per loop

However, now I have this kind of code, where it is not loops of single number, but loops of arrays. (Yes... I am solving a PDE so this happens).

I know the following example is stupid, but just try to get the idea of the type of calculation:

Pure python:

def func1():
    array1 = np.random.rand(50000, 4)
    array2 = np.random.rand(50000)
    for i in range(1000):
        array1[:, 0] += array2
        array1[:, 1] += array2
        array1[:, 2] += array2
        array1[:, 3] += array2
    return

Cython:

def func1():
    cdef np.ndarray[np.double_t, ndim=2] array1 = np.random.rand(50000, 4)
    cdef np.ndarray[np.double_t, ndim=1] array2 = np.random.rand(50000)
    for i in range(1000):
        array1[:, 0] += array2
        array1[:, 1] += array2
        array1[:, 2] += array2
        array1[:, 3] += array2
    return

And there is no improvement almost at all. In the meantime, I do know that Python is not good at handling these huge loops due to big overhead.

# Python
1 loops, best of 3: 299 ms per loop
# Cython
1 loops, best of 3: 300 ms per loop

Any suggestions in how I should improve these kind of code?

Hm. might it be that you loop along the slow index? – Bort May 25 '14 at 18:08 — Bort, May 25 '14 at 18:08

JoshAdel · Accepted Answer · 2014-05-25T18:18:41.763

In these two other implementations, I've played around with

Using cython compiler directives to remove some of the checks that numpy usually has to make
Use typed memoryviews so that I can specify memory layout (and sometimes they are faster in general compared to the older buffer interface)
Unrolled the loops so that we don't use numpy's slice machinary:

Otherwise, you are just using numpy via cython, and numpy already implements this in fairly fast c code under the hood.

The methods:

import numpy as np
cimport numpy as np

cimport cython

def func1():
    cdef np.ndarray[np.double_t, ndim=2] array1 = np.random.rand(50000, 4)
    cdef np.ndarray[np.double_t, ndim=1] array2 = np.random.rand(50000)
    cdef unsigned int i
    for i in range(1000):
        array1[:, 0] += array2
        array1[:, 1] += array2
        array1[:, 2] += array2
        array1[:, 3] += array2
    return

@cython.boundscheck(False)
@cython.wraparound(False)
def func2():
    cdef np.ndarray[np.double_t, ndim=2] array1 = np.random.rand(50000, 4)
    cdef np.ndarray[np.double_t, ndim=1] array2 = np.random.rand(50000)
    cdef unsigned int i, k
    for i in range(1000):
        for k in xrange(50000):
            array1[k, 0] += array2[k]
            array1[k, 1] += array2[k]
            array1[k, 2] += array2[k]
            array1[k, 3] += array2[k]
    return


@cython.boundscheck(False)
@cython.wraparound(False)
def func3():
    cdef np.ndarray[np.double_t, ndim=2] array1 = np.random.rand(50000, 4)
    cdef np.ndarray[np.double_t, ndim=1] array2 = np.random.rand(50000)
    cdef np.double_t[::1] a2 = array2
    cdef np.double_t[:,::1] a1 = array1
    cdef unsigned int i, k

    for i in range(1000):
        for k in xrange(50000):
            a1[k, 0] += a2[k]
            a1[k, 1] += a2[k]
            a1[k, 2] += a2[k]
            a1[k, 3] += a2[k]
    return

The timings (on my machine - compilers and hardware may of course effect the timings):

Pure numpy: 1 loops, best of 3: 419 ms per loop
Your original cython function with typing i: 1 loops, best of 3: 428 ms per loop
func2: 1 loops, best of 3: 336 ms per loop
func3: 1 loops, best of 3: 206 ms per loop

I am confident, that if you swap the slow and fast index, you should gain another factor of 2. — Bort, May 25 '14 at 18:12
Thank you Josh! I have updated my code (the real one, not this snippet) and it worked like magic. I will update my OP now just as a summary. — Yuxiang Wang, May 25 '14 at 19:18
@loganecolss Memory order is row major for C. So array1[k,0] += array2[k]` should be much slower than `array1[0,k] +=array2[k]`. Of course array1 has to have the correct format. — Bort, May 25 '16 at 13:22

score 1 · Answer 2 · edited May 23 '17 at 12:31

1

Largely, you're just wrong. Python is great at these types of loops.

This is because they are heavyweight. As Josh Adel shows, an effectively pure-C variant achieves only a 2 to 4 times speedup.

The best thing to do is to look for a few types of inefficiency:

Lots of slices (such as slicing many small rows)

This is best solved by iterating manually and moving to Cython or Numba (or another JITing compiler)
Full-on Numpy array calculations that just aren't fast enough

Numba and Parakeet can help, but you really want Theano or numexpr
More complicated, non-JITable, non-expression-able slow stuff

Cython or C are your only choices, so move to them

See also the first part of this answer.

edited May 23 '17 at 12:31

Community

1
1

answered May 25 '14 at 19:04

Veedrac

58,273
15
112
169

Thank you so much for your insight! They are very helpful. I have some questions though, 1) Slices are slow, but what about `my_array[i, j]`? Are they seen as 'slices' too? I noticed that they are also 'yellow' in the cython html output. 2) By full-on numpy array calculation, do you mean calling numpy functions or using numpy broadcasting? – Yuxiang Wang May 25 '14 at 19:17
`my_array[i, j]` will end up going through Python (yellow) *if*: **[1]** `my_array` is not typed as `numpy.ndarray[dtype, ndim=2]` or (better) `dtype[:, :]` **[2]** `i` or `j` are not typed with an integral type (like `int` or `long`) **[3]** bounds checking is on (although this rarely causes *much* slowdown and shows up in a light color). In fact, bounds checking stays in C until an exception needs to be thrown. – Veedrac May 25 '14 at 19:21
Hmmm... Weird. I'll look into it more to see what I have done wrong, but 1) my_array was typed as double [:, :] 2) i and j were typed as unsigned int 3) bounds checking was off, yet they are still yellow. I'll dig into it more.. Thanks again! – Yuxiang Wang May 25 '14 at 19:25
Sometimes Cython colours the wrong line, such as at the start and end of functions. You can click on the yellow line to get a hint about what non-C parts exist (they will be coloured) and really you just need to check the indentation level of that part. – Veedrac May 25 '14 at 19:58
Ahh - thanks for the tip! I discovered the answer - it is checking on division by zero. And I unfortunately do not know how to turn this check off - cdivision=False wouldn't help. – Yuxiang Wang May 25 '14 at 21:01
And also, thank you for introducing me Numba and Theano! I heard about them but never tried them out. It seems that Numba could be very cool, but I tried it on my code and it seems still buggy... (I actually just reported one). Given that Theano requires a significant amount of code re-writing, it seems that Cython is my best friend for now. Thanks again! – Yuxiang Wang May 25 '14 at 21:02
The stuff about `cdivision=False` warrants a new question if you don't get it worked out. The documentation says it should disable the check. – Veedrac May 25 '14 at 21:05
I made a stupid mistake - the default is False. I should use `cdivision=True`. Now done! – Yuxiang Wang May 25 '14 at 21:09

Cython to accelerate loops with array broadcast

2 Answers2