I recently played with Cython and Numba to accelerate small pieces of a python that does numerical simulation. At first, developing with numba seems easier. Yet, I found difficult to understand when numba will provide a better performance and when it will not.
One example of unexpected performance drop is when I use the function np.zeros()
to allocate a big array in a compiled function. For example, consider the three function definitions:
import numpy as np
from numba import jit
def pure_python(n):
mat = np.zeros((n,n), dtype=np.double)
# do something
return mat.reshape((n**2))
@jit(nopython=True)
def pure_numba(n):
mat = np.zeros((n,n), dtype=np.double)
# do something
return mat.reshape((n**2))
def mixed_numba1(n):
return mixed_numba2(np.zeros((n,n)))
@jit(nopython=True)
def mixed_numba2(array):
n = len(array)
# do something
return array.reshape((n,n))
# To compile
pure_numba(10)
mixed_numba1(10)
Since the #do something
is empty, I do not expect the pure_numba
function to be faster. Yet, I was not expecting such a performance drop:
n=10000
%timeit x = pure_python(n)
%timeit x = pure_numba(n)
%timeit x = mixed_numba1(n)
I obtain (python 3.7.7, numba 0.48.0 on a mac)
4.96 µs ± 65.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
344 ms ± 7.76 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
3.8 µs ± 30.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Here, the numba code is much slower when I use the function np.zeros()
inside the compiled function. It works normally when the np.zeros()
is outside the function.
Am I doing something wrong here or should I always allocate big arrays like these outside functions that are compiled by numba?
Update
This seems related to a lazy initialization of the matrices by np.zeros((n,n))
when n
is large enough (see Performance of zeros function in Numpy ).
for n in [1000, 2000, 5000]:
print('n=',n)
%timeit x = pure_python(n)
%timeit x = pure_numba(n)
%timeit x = mixed_numba1(n)
gives me:
n = 1000
468 µs ± 15.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
296 µs ± 6.55 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
300 µs ± 2.26 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
n = 2000
4.79 ms ± 182 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4.45 ms ± 36 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4.54 ms ± 127 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
n = 5000
270 µs ± 4.66 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
104 ms ± 599 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
119 µs ± 1.24 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)