According to https://murillogroupmsu.com/julia-set-speed-comparison/ numba used on pure python code is faster than used on python code that uses numpy. Is that generally true and why?
In https://stackoverflow.com/a/25952400/4533188 it is explained why numba on pure python is faster than numpy-python: numba sees more code and has more ways to optimize the code than numpy which only sees a small portion.
Numba just replaces numpy functions with its own implementation. They can be faster/slower and the results can also differ. The problem is the mechanism how this replacement happens. Quite often there are unnecessary temporary arrays and loops involved, which can be fused.
Loop fusing and removing temporary arrays is not an easy task. The behavior also differs if you compile for the parallel target which is a lot better in loop fusing or for a single threaded target.
[Edit]
The optimizations Section 1.10.4. Diagnostics (like loop fusing) which are done in the parallel accelerator can in single threaded mode also be enabled by settingparallel=True
and nb.parfor.sequential_parfor_lowering = True
. 1
Example
#only for single-threaded numpy test
import os
os.environ["OMP_NUM_THREADS"] = "1"
import numba as nb
import numpy as np
a=np.random.rand(100_000_000)
b=np.random.rand(100_000_000)
c=np.random.rand(100_000_000)
d=np.random.rand(100_000_000)
#Numpy version
#every expression is evaluated on its own
#the summation algorithm (Pairwise summation) isn't equivalent to the algorithm I used below
def Test_np(a,b,c,d):
return np.sum(a+b*2.+c*3.+d*4.)
#The same code, but for Numba (results and performance differ)
@nb.njit(fastmath=False,parallel=True)
def Test_np_nb(a,b,c,d):
return np.sum(a+b*2.+c*3.+d*4.)
#the summation isn't fused, aprox. the behaiviour of Test_np_nb for
#single threaded target
@nb.njit(fastmath=False,parallel=True)
def Test_np_nb_eq(a,b,c,d):
TMP=np.empty(a.shape[0])
for i in nb.prange(a.shape[0]):
TMP[i]=a[i]+b[i]*2.+c[i]*3.+d[i]*4.
res=0.
for i in nb.prange(a.shape[0]):
res+=TMP[i]
return res
#The usual way someone would implement this in Numba
@nb.njit(fastmath=False,parallel=True)
def Test_nb(a,b,c,d):
res=0.
for i in nb.prange(a.shape[0]):
res+=a[i]+b[i]*2.+c[i]*3.+d[i]*4.
return res
Timings
#single-threaded
%timeit res_1=Test_nb(a,b,c,d)
178 ms ± 1.28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_2=Test_np(a,b,c,d)
2.72 s ± 118 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_3=Test_np_nb(a,b,c,d)
562 ms ± 5.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_4=Test_np_nb_eq(a,b,c,d)
612 ms ± 6.08 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#single-threaded
#parallel=True
#nb.parfor.sequential_parfor_lowering = True
%timeit res_1=Test_nb(a,b,c,d)
188 ms ± 5.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit res_3=Test_np_nb(a,b,c,d)
184 ms ± 817 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit res_4=Test_np_nb_eq(a,b,c,d)
185 ms ± 1.14 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
#multi-threaded
%timeit res_1=Test_nb(a,b,c,d)
105 ms ± 3.08 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_2=Test_np(a,b,c,d)
1.78 s ± 75.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_3=Test_np_nb(a,b,c,d)
102 ms ± 686 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_4=Test_np_nb_eq(a,b,c,d)
102 ms ± 1.48 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Results
#single-threaded
res_1=Test_nb(a,b,c,d)
499977967.27572954
res_2=Test_np(a,b,c,d)
499977967.2756622
res_3=Test_np_nb(a,b,c,d)
499977967.2756614
res_4=Test_np_nb_eq(a,b,c,d)
499977967.2756614
#multi-threaded
res_1=Test_nb(a,b,c,d)
499977967.27572465
res_2=Test_np(a,b,c,d)
499977967.2756622
res_3=Test_np_nb(a,b,c,d)
499977967.27572465
res_4=Test_np_nb_eq(a,b,c,d)
499977967.27572465
Conclusion
It depends on the use case what is best to use. Some algorithms can be easily written in a few lines in Numpy, other algorithms are hard or impossible to implement in a vectorized fashion.
I also used a summation example on purpose here. Doing it all at once is easy to code and a lot faster, but if I want the most precise result I would definitely use a more sophisticated algorithm which is already implemented in Numpy. Of course you can do the same in Numba, but that would be more work to do.