19

When iterating over NumPy arrays, Numba seems dramatically faster than Cython.
What Cython optimizations am I possibly missing?

Here is a simple example:

Pure Python code:

import numpy as np

def f(arr):
  res=np.zeros(len(arr))
   
  for i in range(len(arr)):
     res[i]=(arr[i])**2
    
  return res

arr=np.random.rand(10000)
%timeit f(arr)

out: 4.81 ms ± 72.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


Cython code (within Jupyter):

%load_ext cython
%%cython

import numpy as np
cimport numpy as np
cimport cython
from libc.math cimport pow

#@cython.boundscheck(False)
#@cython.wraparound(False)

cpdef f(double[:] arr):
   cdef np.ndarray[dtype=np.double_t, ndim=1] res
   res=np.zeros(len(arr),dtype=np.double)
   cdef double[:] res_view=res
   cdef int i

   for i in range(len(arr)):
      res_view[i]=pow(arr[i],2)
    
   return res

arr=np.random.rand(10000)
%timeit f(arr)

Out:445 µs ± 5.49 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


Numba code:

import numpy as np
import numba as nb

@nb.jit(nb.float64[:](nb.float64[:]))
def   f(arr):
   res=np.zeros(len(arr))
   
   for i in range(len(arr)):
       res[i]=(arr[i])**2
    
   return res

arr=np.random.rand(10000)
%timeit f(arr)

Out:9.59 µs ± 98.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In this example, Numba is almost 50 times faster than Cython.
Being a Cython beginner, I guess I am missing something.

Of course in this simple case using the NumPy square vectorized function would have been far more suitable:

%timeit np.square(arr)

Out:5.75 µs ± 78.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Community
  • 1
  • 1
Greg A
  • 201
  • 1
  • 2
  • 6
  • 3
    why don't you do arr[i]**2 also in the cython code? I think a possible reason is that `pow(arr[i],2)` will treat that `2` as a float and make the computation a lot more complicated – Antonio Ragagnin Nov 06 '18 at 11:31
  • Thanks but I have tried also using arr[i]**2 instead of pow(arr[i],2) , the perf of both solutitions are nearly equal. In general, even with a simple iteration over a numpy array without mathematical transformation, numba compiled function runs faster than cython. – Greg A Nov 06 '18 at 13:09

1 Answers1

30

As @Antonio has pointed out, using pow for a simple multiplication is not very wise and leads to quite an overhead:

Thus, replacing pow(arr[i], 2) through arr[i]*arr[i] leads to a pretty large speed-up:

cython-pow-version        356 µs
numba-version              11 µs
cython-mult-version        14 µs

The remaining difference is probably due to difference between the compilers and levels of optimizations (llvm vs MSVC in my case). You might want to use clang to match numba performance (see for example this SO-answer)

In order to make the optimization easier for the compiler, you should declare the input as continuous array, i.e. double[::1] arr (see this question why it is important for vectorization), use @cython.boundscheck(False) (use option -a to see that there is less yellow) and also add compiler flags (i.e. -O3, -march=native or similar depending on your compiler to enable the vectorization, watch out for build-flags used by default which can inhibit some optimization, for example -fwrapv). In the end you might want to write the working-horse-loop in C, compile with the right combination of flags/compiler and use Cython to wrap it.

By the way, by typing function's paramters as nb.float64[:](nb.float64[:]) you decrease the performance of numba - it is no longer allowed to assume that the input array is continuous, thus ruling the vectorization out. Let numba detect the types (or define it as continuous, i.e. nb.float64[::1](nb.float64[::1]), and you will get better performance:

@nb.jit(nopython=True)
def nb_vec_f(arr):
   res=np.zeros(len(arr))

   for i in range(len(arr)):
       res[i]=(arr[i])**2

   return res

Leads to the following improvement:

%timeit f(arr)  # numba version
# 11.4 µs ± 137 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit nb_vec_f(arr)
# 7.03 µs ± 48.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

And as pointed out by @max9111, we don't have to initialize the resulting array with zeros, but can use np.empty(...) instead of np.zeros(...) - this version even beats the numpy's np.square()

The performances of different approaches on my machine are:

numba+vectorization+empty     3µs
np.square                     4µs
numba+vectorization           7µs
numba missed vectorization   11µs
cython+mult                  14µs
cython+pow                  356µs
ead
  • 32,758
  • 6
  • 90
  • 153
  • Thank you very much for your insight! With your optimizations my cython function runs nearly as fast as numba. I – Greg A Nov 06 '18 at 13:31
  • 4
    It is not exactly related to the question, but one little thing is missing. The uneccesary zeroing of the allocated array at the beginning takes about 30+% of the total runtime and is at least in Numba not optimized away by the compiler. – max9111 Nov 06 '18 at 15:29
  • @ead This is only a question out of curiosity. But a while ago I had quite a simmilar problem with pow in cython. If you don't hardcode the exponent within Numba and SVML is present it calls SVML's pow function on 256 bit vectors, resulting in about 150µs. Is there a simple alternative within Cython without using icc? – max9111 Nov 06 '18 at 16:16
  • 1
    @max9111, I must confess I have never tried it. I would probably rather write the code in C and wrap the functionality in Cython than try to access the the "intristics" directly from Cython – ead Nov 06 '18 at 16:35