Cython: slow numpy arrays

Question

I am trying to speed up my code using cython. After translating a code into cython from python I am seeing that I have not gained any speed up. I think the origin of the problem is the bad performance I am getting by using numpy arrays into cython.

I have came up with a very simple program to show this:

############### test.pyx #################
import numpy as np
cimport numpy as np
cimport cython

def func1(long N):

    cdef double sum1,sum2,sum3
    cdef long i

    sum1 = 0.0
    sum2 = 0.0
    sum3 = 0.0

    for i in range(N):
        sum1 += i
        sum2 += 2.0*i
        sum3 += 3.0*i

    return sum1,sum2,sum3

def func2(long N):

    cdef np.ndarray[np.float64_t,ndim=1] sum_arr
    cdef long i

    sum_arr = np.zeros(3,dtype=np.float64)

    for i in range(N):
        sum_arr[0] += i
        sum_arr[1] += 2.0*i
        sum_arr[2] += 3.0*i

    return sum_arr

def func3(long N):

    cdef double sum_arr[3]
    cdef long i

    sum_arr[0] = 0.0
    sum_arr[1] = 0.0
    sum_arr[2] = 0.0

    for i in range(N):
        sum_arr[0] += i
        sum_arr[1] += 2.0*i
        sum_arr[2] += 3.0*i

    return sum_arr
##########################################

################## test.py ###############
import time
import test as test

N = 1000000000

for i in xrange(10):

    start = time.time()
    sum1,sum2,sum3 = test.func1(N)
    print 'Time taken = %.3f'%(time.time()-start)

print '\n'
for i in xrange(10):
    start = time.time()
    sum_arr = test.func2(N)
    print 'Time taken = %.3f'%(time.time()-start)

print '\n'
for i in xrange(10):
    start = time.time()
    sum_arr = test.func3(N)
    print 'Time taken = %.3f'%(time.time()-start)
############################################

And from python test.py I get:

Time taken = 1.445
Time taken = 1.433
Time taken = 1.434
Time taken = 1.428
Time taken = 1.449
Time taken = 1.425
Time taken = 1.421
Time taken = 1.451
Time taken = 1.483
Time taken = 1.418

Time taken = 2.623
Time taken = 2.603
Time taken = 2.977
Time taken = 3.237
Time taken = 2.748
Time taken = 2.798
Time taken = 2.811
Time taken = 2.783
Time taken = 2.585
Time taken = 2.595

Time taken = 1.503
Time taken = 1.529
Time taken = 1.509
Time taken = 1.543
Time taken = 1.427
Time taken = 1.425
Time taken = 1.423
Time taken = 1.415
Time taken = 1.414
Time taken = 1.418

My question is: why func2 is almost 2x slower than func1 and func3?

Is there a way to improve this?

Thanks!

######## UPDATE

My real problem is as follows. I am calling a function that accepts a 3D array (say P[i,j,k]). The function will loop through each element and compute several quantities: a quantity that depends on the value of the array in that position (say A=f(P[i,j,k])) and another quantities that only depend on the position of the array itself (B=g(i,j,k)). Schematically things will look like this:

for i in xrange(N):
    corr1 = h(i,val)

    for j in xrange(N):
        corr2 = h(j,val)

        for k in xrange(N):
            corr3 = h(k,val)

            A = f(P[i,j,k])
            B = g(i,j,k)
            Arr[B] += A*corr1*corr2*corr3

where val is a property of the 3D array represented by a number. This number can be different for different fields.

Since I have to do this operation over many 3D arrays, I though that it would be better if I create a new routine that accepts many different input 3D arrays, leaving the number of arrays unknown a-priori. The idea is that since B will be exactly the same over all arrays, I can avoid computing it for each array and only compute it once. The problem is that the corr1, corr2, corr3 above will become arrays:

If I have a number of 3D arrays equal to num_3D_arrays I am doing something as:

for i in xrange(N):
    for p in xrange(num_3D_arrays):
        corr1[p] = h(i,val[p])

    for j in xrange(N):
        for p in xrange(num_3D_arrays):
            corr2[p] = h(j,val[p])

        for k in xrange(N):
            for p in xrange(num_3D_arrays):
                corr3[p] = h(k,val[p])

            B = g(i,j,k)
            for p in xrange(num_3D_arrays):
                A[p] = f(P[i,j,k])
                Arr[p,B] += A[p]*corr1[p]*corr2[p]*corr3[p]

So the val that I am changing the variables corr1,corr2,corr3 and A from scalar to arrays is killing the performance that I would expect to avoid doing the big loop.

#

the code `for i in range(N): sum_arr[0] += i sum_arr[1] += 2.0*i sum_arr[2] += 3.0*i` ignores all that numpy is good at. Numpy is not fast because you can access indices fast, but because it can do numeric operations quickly. But not like that. I'd recommend reading into numpy — Piotr Kamoda, Feb 20 '17 at 16:29
And i suppose it would be hard to make it faster. Because provided you are stubborn enough to use `numpy`, you'd have to create numpy array in that loop and do np.sum(), but creating numpy array is probably slowest thing possible in that snippet. I also recommend to check each line separately instead of this simple timeit. **[some reading on profiling](http://stackoverflow.com/questions/582336/how-can-you-profile-a-script)** — Piotr Kamoda, Feb 20 '17 at 16:40
OK thanks! The problem in my case is that I can not define individual variables as in func1, but I need to define an array of a size that I do not know a priori. Is there a different way to do this than using a numpy array? — Francisco, Feb 20 '17 at 16:44
But is this target algorithm? That you add constans times i in loop to all variables? You can do it however you wish, but regular python arrays are quite quick in indexing. I don't know if you can get any quicker without some drawbacks. — Piotr Kamoda, Feb 20 '17 at 16:47
Yes. In my problem I need to loop over a number of different fields whose number I do not know a priori. — Francisco, Feb 20 '17 at 16:59
And do what? Add a constans? How do you know constans for unnknown number of fields? It's to hard to tell without really seeing what's the target algorithm — Piotr Kamoda, Feb 20 '17 at 17:10

score 3 · Answer 1 · answered Feb 21 '17 at 12:38

There are a couple things you can do to speed up array indexing in Cython:

Turn of bounds checking and wraparound.
Use typed memoryviews.
Declare the array is contiguous.

So for your function:

@cython.boundscheck(False)
@cython.wraparound(False)
def func2(long N):

    cdef np.float64_t[::1] sum_arr
    cdef long i

    sum_arr = np.zeros(3,dtype=np.float64)

    for i in range(N):
        sum_arr[0] += i
        sum_arr[1] += 2.0*i
        sum_arr[2] += 3.0*i

    return sum_arr

For the original code Cython produced the following C code for the line sum_arr[0] += i:

__pyx_t_12 = 0;
__pyx_t_6 = -1;
if (__pyx_t_12 < 0) {
  __pyx_t_12 += __pyx_pybuffernd_sum_arr.diminfo[0].shape;
  if (unlikely(__pyx_t_12 < 0)) __pyx_t_6 = 0;
} else if (unlikely(__pyx_t_12 >= __pyx_pybuffernd_sum_arr.diminfo[0].shape)) __pyx_t_6 = 0;
if (unlikely(__pyx_t_6 != -1)) {
  __Pyx_RaiseBufferIndexError(__pyx_t_6);
  {__pyx_filename = __pyx_f[0]; __pyx_lineno = 13; __pyx_clineno = __LINE__; goto __pyx_L1_error;}
}
*__Pyx_BufPtrStrided1d(__pyx_t_5numpy_float64_t *, __pyx_pybuffernd_sum_arr.rcbuffer->pybuffer.buf, __pyx_t_12, __pyx_pybuffernd_sum_arr.diminfo[0].strides) += __pyx_v_i;

With the improvements above:

__pyx_t_8 = 0;
*((double *) ( /* dim=0 */ ((char *) (((double *) __pyx_v_sum_arr.data) + __pyx_t_8)) )) += __pyx_v_i;

Thanks a lot! This indeed works and I get the same speed as func1 and func2. Awesome! — Francisco, Feb 21 '17 at 14:42

B. M. · Answer 2 · 2017-02-20T17:41:41.147

0

why func2 is almost 2x slower than func1?

It's because indexing cause an indirection, so you double the number of elementary operations. Calculate the sum like in func1, then affect with sum=array([sum1,sum2,sum3])
How to speed python code ?
1. Numpy is the first good idea , It raise nearly C speed with no effort.
2. Numba can fill the gap with no effort too, and is very simple.
3. Cython for critical cases.

Here some illustration of that:

# python way
def func1(N):
    sum1 = 0.0
    sum2 = 0.0
    sum3 = 0.0

    for i in range(N):
        sum1 += i
        sum2 += 2.0*i
        sum3 += 3.0*i

    return sum1,sum2,sum3

# numpy way
def func2(N):
    aran=arange(float(N))
    sum1=aran.sum()
    sum2=(2.0*aran).sum()
    sum3=(3.0*aran).sum()
    return sum1,sum2,sum3

#numba way
import numba    
func3 =numba.njit(func1)

"""
In [609]: %timeit func1(10**6)
1 loop, best of 3: 710 ms per loop

In [610]: %timeit func2(1e6)
100 loops, best of 3: 22.2 ms per loop

In [611]: %timeit func3(10e6)
100 loops, best of 3: 2.87 ms per loop
"""

edited Feb 20 '17 at 17:41

answered Feb 20 '17 at 17:35

B. M.

18,243
2
35
54

Thanks for the reply! My real problem is more complicated than func1 and func2, which I just used to show the problem. What I dont fully understand is why the indexing is so slow with numpy. If I define a routine: – Francisco Feb 20 '17 at 17:41
def func3(long N): cdef double sum_arr[3] cdef long i sum_arr[0] = 0.0; sum_arr[01] = 0.0; sum_arr[2] = 0.0 for i in range(N): sum_arr[0] += i sum_arr[1] += 2.0*i sum_arr[2] += 3.0*i return sum_arr – Francisco Feb 20 '17 at 17:43
sum1 += 23 cost 1ns, when sum_arr[i]+=23 is sum_arr[ref+i] +=23 cost 2ns. it' snot a numpy problem, it's a assembly level problem. – B. M. Feb 20 '17 at 17:45
numpy arrays are C array. so the problem is not a numpy one. Edit your post to explain clearly what is unknown size, perhaps we can find a workaround. – B. M. Feb 20 '17 at 17:52
I have updated the post to try to explain my algorithm – Francisco Feb 20 '17 at 18:48
Because in that case I will be calling the function g(i,j,k) NxNxNxnum_3D_arrays, while in the current implementation I just call it NxNxN. – Francisco Feb 20 '17 at 21:08

hpaulj · Answer 3 · 2017-02-21T01:54:39.813

Look at the html produced by cython -a ...pyx.

For func1, the sum1 += i line expands to :

+15:         sum1 += i
    __pyx_v_sum1 = (__pyx_v_sum1 + __pyx_v_i);

for func3, with a C array

+45:         sum_arr[0] += i
    __pyx_t_3 = 0;
    (__pyx_v_sum_arr[__pyx_t_3]) = ((__pyx_v_sum_arr[__pyx_t_3]) + __pyx_v_i);

Slightly more complicated, but a straight forward c.

But for func2:

+29:         sum_arr[0] += i
    __pyx_t_12 = 0;
    __pyx_t_6 = -1;
    if (__pyx_t_12 < 0) {
      __pyx_t_12 += __pyx_pybuffernd_sum_arr.diminfo[0].shape;
      if (unlikely(__pyx_t_12 < 0)) __pyx_t_6 = 0;
    } else if (unlikely(__pyx_t_12 >= __pyx_pybuffernd_sum_arr.diminfo[0].shape)) __pyx_t_6 = 0;
    if (unlikely(__pyx_t_6 != -1)) {
      __Pyx_RaiseBufferIndexError(__pyx_t_6);
      __PYX_ERR(0, 29, __pyx_L1_error)
    }
    *__Pyx_BufPtrStrided1d(__pyx_t_5numpy_float64_t *, __pyx_pybuffernd_sum_arr.rcbuffer->pybuffer.buf, __pyx_t_12, __pyx_pybuffernd_sum_arr.diminfo[0].strides) += __pyx_v_i;

Much more complicated with references to numpy functions (e.g. Pyx_BUfPtrStrided1d). Even initializing the array is complicated:

+26:     sum_arr = np.zeros(3,dtype=np.float64)
  __pyx_t_1 = __Pyx_GetModuleGlobalName(__pyx_n_s_np); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 26, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_1);
   ....

I expect that moving the sum_arr creation to the calling Python, and passing it as an argument to func2 would save some time.

Have you read this guide for using memoryviews:

http://cython.readthedocs.io/en/latest/src/userguide/memoryviews.html

You'll get the best cython performance if you focus on writing the low level operations so they translate into simple c. In

    for k in xrange(N):
        corr3 = h(k,val)

        A = f(P[i,j,k])
        B = g(i,j,k)
        Arr[B] += A*corr1*corr2*corr3

It's not the loops on i,j,k that will slow you down. It's evaluating h, f, and g each time, as well as the Arr[B] +=.... Those functions should be tightly coded cython, not general Python functions. Look at the compiled simplicity of the sum3d function in the memoryview guide.

Thanks! I now understand what it is going on. The solution proposed by @user7138814 works nicely — Francisco, Feb 21 '17 at 14:43

Cython: slow numpy arrays

3 Answers3