Numba on pure python VS Numpa on numpy-python

Question

Using numba results in much faster programs than using pure python:

It seems established by now, that numba on pure python is even (most of the time) faster than numpy-python, e.g. https://jakevdp.github.io/blog/2015/02/24/optimizing-python-with-numpy-and-numba/.

According to https://murillogroupmsu.com/julia-set-speed-comparison/ numba used on pure python code is faster than used on python code that uses numpy. Is that generally true and why?

In https://stackoverflow.com/a/25952400/4533188 it is explained why numba on pure python is faster than numpy-python: numba sees more code and has more ways to optimize the code than numpy which only sees a small portion.

Does this answer my question? Do I hinder numba to fully optimize my code when using numpy, because numba is forced to use the numpy routines instead of finding an even more optimal way? I had hoped that numba would realise this and not use the numpy routines if it is non-beneficial. Then it would use the numpy routines only it is an improvement (afterall numpy is pretty well tested). Afterall "Support for NumPy arrays is a key focus of Numba development and is currently undergoing extensive refactorization and improvement."

FYI: Note that a few of these references are quite old and might be outdated. — MSeifert, Oct 14 '19 at 16:01

MSeifert · Accepted Answer · 2019-10-14T16:03:42.983

Let's get a few things straight before I answer the specific questions:

I'll only consider nopython code for this answer, object-mode code is often slower than pure Python/NumPy equivalents.
I'll ignore the numba GPU capabilities for this answer - it's difficult to compare code running on the GPU with code running on the CPU.
When you call a NumPy function in a numba function you're not really calling a NumPy function. Everything that numba supports is re-implemented in numba. That applies to NumPy functions but also to Python data types in numba! So the implementation details between Python/NumPy inside a numba function and outside might be different because they are totally different functions/types.
Numba generates code that is compiled with LLVM. Numba is not magic, it's just a wrapper for an optimizing compiler with some optimizations built into numba!

It seems established by now, that numba on pure python is even (most of the time) faster than numpy-python

No. Numba is often slower than NumPy. It depends on what operation you want to do and how you do it. Numba is reliably faster if you handle very small arrays, or if the only alternative would be to manually iterate over the array.

numba used on pure python code is faster than used on python code that uses numpy. Is that generally true and why?

That depends on the code - there are probably more cases where NumPy beats numba. However the trick is to apply numba where there's no corresponding NumPy function or where you need to chain lots of NumPy functions or use NumPy functions that aren't ideal. The trick is to know when a numba implementation might be faster and then it's best to not use NumPy functions inside numba because you would get all the drawbacks of a NumPy function. However it requires experience to know the cases when and how to apply numba - it's easy to write a very slow numba function by accident.

Do I hinder numba to fully optimize my code when using numpy, because numba is forced to use the numpy routines instead of finding an even more optimal way?

Yep.

I had hoped that numba would realise this and not use the numpy routines if it is non-beneficial.

No, that's not how numba works at the moment. Numba just creates code for LLVM to compile. Maybe that's a feature numba will have in the future (who knows). Currently numba performs best if you write the loops and operations yourself and avoid calling NumPy functions inside numba functions.

There are a few libraries that use expression-trees and might optimize non-beneficial NumPy function calls - but these typically don't allow fast manual iteration. For example numexpr can optimize multiple chained NumPy function calls. At the moment it's either fast manual iteration (cython/numba) or optimizing chained NumPy calls using expression trees (numexpr). Maybe it's not even possible to do both inside one library - I don't know.

Numba and Cython are great when it comes to small arrays and fast manual iteration over arrays. NumPy/SciPy are great because they come with a whole lot of sophisticated functions to do various tasks out of the box. Numexpr is great for chaining multiple NumPy function calls. In some cases Python is faster than any of these tools.

In my experience you can get the best out of the different tools if you compose them. Don't limit yourself to just one tool.

Do you have tips (or possibly reading material) that would help with getting a better understanding when to use numpy / numba / numexpr? If you think it is worth asking a new question for that, I can also post a new question. — Make42, Mar 23 '21 at 15:11
I haven't worked with numba in quite a while now. So I don't think I have up-to-date information or references. But a question asking for reading material is also off-topic on StackOverflow ... not sure if I can help you there :( — MSeifert, Mar 23 '21 at 18:26

score 4 · Answer 2 · edited Jun 20 '20 at 09:12

According to https://murillogroupmsu.com/julia-set-speed-comparison/ numba used on pure python code is faster than used on python code that uses numpy. Is that generally true and why?

In https://stackoverflow.com/a/25952400/4533188 it is explained why numba on pure python is faster than numpy-python: numba sees more code and has more ways to optimize the code than numpy which only sees a small portion.

Numba just replaces numpy functions with its own implementation. They can be faster/slower and the results can also differ. The problem is the mechanism how this replacement happens. Quite often there are unnecessary temporary arrays and loops involved, which can be fused.

Loop fusing and removing temporary arrays is not an easy task. The behavior also differs if you compile for the parallel target which is a lot better in loop fusing or for a single threaded target.

[Edit] The optimizations Section 1.10.4. Diagnostics (like loop fusing) which are done in the parallel accelerator can in single threaded mode also be enabled by settingparallel=True and nb.parfor.sequential_parfor_lowering = True. 1

Example

#only for single-threaded numpy test
import os
os.environ["OMP_NUM_THREADS"] = "1"

import numba as nb
import numpy as np

a=np.random.rand(100_000_000)
b=np.random.rand(100_000_000)
c=np.random.rand(100_000_000)
d=np.random.rand(100_000_000)

#Numpy version
#every expression is evaluated on its own 
#the summation algorithm (Pairwise summation) isn't equivalent to the algorithm I used below
def Test_np(a,b,c,d):
    return np.sum(a+b*2.+c*3.+d*4.)

#The same code, but for Numba (results and performance differ)
@nb.njit(fastmath=False,parallel=True)
def Test_np_nb(a,b,c,d):
    return np.sum(a+b*2.+c*3.+d*4.)

#the summation isn't fused, aprox. the behaiviour of Test_np_nb for 
#single threaded target
@nb.njit(fastmath=False,parallel=True)
def Test_np_nb_eq(a,b,c,d):
    TMP=np.empty(a.shape[0])
    for i in nb.prange(a.shape[0]):
        TMP[i]=a[i]+b[i]*2.+c[i]*3.+d[i]*4.

    res=0.
    for i in nb.prange(a.shape[0]):
        res+=TMP[i]

    return res

#The usual way someone would implement this in Numba
@nb.njit(fastmath=False,parallel=True)
def Test_nb(a,b,c,d):
    res=0.
    for i in nb.prange(a.shape[0]):
        res+=a[i]+b[i]*2.+c[i]*3.+d[i]*4.
    return res

Timings

#single-threaded
%timeit res_1=Test_nb(a,b,c,d)
178 ms ± 1.28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_2=Test_np(a,b,c,d)
2.72 s ± 118 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_3=Test_np_nb(a,b,c,d)
562 ms ± 5.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_4=Test_np_nb_eq(a,b,c,d)
612 ms ± 6.08 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

#single-threaded
#parallel=True
#nb.parfor.sequential_parfor_lowering = True
%timeit res_1=Test_nb(a,b,c,d)
188 ms ± 5.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit res_3=Test_np_nb(a,b,c,d)
184 ms ± 817 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit res_4=Test_np_nb_eq(a,b,c,d)
185 ms ± 1.14 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

#multi-threaded
%timeit res_1=Test_nb(a,b,c,d)
105 ms ± 3.08 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_2=Test_np(a,b,c,d)
1.78 s ± 75.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_3=Test_np_nb(a,b,c,d)
102 ms ± 686 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res_4=Test_np_nb_eq(a,b,c,d)
102 ms ± 1.48 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Results

#single-threaded
res_1=Test_nb(a,b,c,d)
499977967.27572954
res_2=Test_np(a,b,c,d)
499977967.2756622
res_3=Test_np_nb(a,b,c,d)
499977967.2756614
res_4=Test_np_nb_eq(a,b,c,d)
499977967.2756614

#multi-threaded
res_1=Test_nb(a,b,c,d)
499977967.27572465
res_2=Test_np(a,b,c,d)
499977967.2756622
res_3=Test_np_nb(a,b,c,d)
499977967.27572465
res_4=Test_np_nb_eq(a,b,c,d)
499977967.27572465

Conclusion

It depends on the use case what is best to use. Some algorithms can be easily written in a few lines in Numpy, other algorithms are hard or impossible to implement in a vectorized fashion.

I also used a summation example on purpose here. Doing it all at once is easy to code and a lot faster, but if I want the most precise result I would definitely use a more sophisticated algorithm which is already implemented in Numpy. Of course you can do the same in Numba, but that would be more work to do.

"for the parallel target which is a lot better in loop fusing" <- do you have a link or citation? That's the first time I heard about that and I would like to learn more. — MSeifert, Oct 14 '19 at 17:58
"The problem is the mechanism how this replacement happens." could you elaborate? As far as I understand it the problem is not the mechanism, the problem is the function which creates the temporary array. That applies to NumPy and the numba implementation. In this regard NumPy is also a bit better than numba because NumPy uses the ref-count of the array to, sometimes, avoid temporary arrays. At least as far as I know. — MSeifert, Oct 14 '19 at 18:00
@MSeifert I added links and timings regarding automatic the loop fusion. The documentation isn't that good in that topic, I learned 5mins ago that this is even possible in single threaded mode. Yes what I wanted to say was: Numba tries to do exactly the same operation like Numpy (which also includes temporary arrays) and afterwards tries loop fusion and optimizing away unnecessary temporary arrays, with sometimes more, sometimes less success. — max9111, Oct 14 '19 at 19:13
I would have expected that 3 is the slowest, since it build a further large temporary array, but it appears to be fastest - how come? — Make42, Oct 15 '19 at 13:40
@Make42 What do you mean with 3? Test_np_nb(a,b,c,d)? In the standard single-threaded version Test_np_nb(a,b,c,d), is about as slow as Test_np_nb_eq(a,b,c,d) — max9111, Oct 15 '19 at 13:47
@max9111: Yes, I meant that expected `Test_np_nb_eq` to be the slowest. But it seems that `Test_np` is the slowest. I am surprised by that. But maybe, numpy needs to create a pretty large array `a+b*2.+c*3.+d*4.` before the summation and numba is able to not do that. Is my reasoning here right? — Make42, Mar 23 '21 at 15:10
Yes. Numpy would even do operations like b*2. as a single calculation (using a temporary array). It was just too lazy to write every operation in `Test_np_nb_eq` out as a separate loop (which would be more similar to numpy). I also would use fastmath=True on summations (makes SIMD-vectorization possible). But the summation itself isn't really comparable to numpy which uses a more advanced algorithm to get higher precision. — max9111, Mar 23 '21 at 15:27
@max9111: Do you - by any chance - have a tip where I can find more reading material on when numba is faster and when numpy is just as good? MSeifert mentioned that there are cases where one is better than the other and optimally one would know when to use which tool. I would like to get a better feeling when to use which tool. Also numexpr would be another contender. — Make42, Mar 23 '21 at 22:54
@Make42 I would say Numba is always beneficial if there is no good numpy solution eg. https://stackoverflow.com/a/58189944/4045774 Sometimes it could also be related to some compiler restrictions (in some cases Clang/Numba is better than MSVC https://stackoverflow.com/a/59003530/4045774 or if you go into details and really need performance https://stackoverflow.com/a/64920528/4045774. The stuff which is to my experience always a bit slower is for example sorting. But to summarize, it is always a combination of coding effort and experience. Usually you would also write code quite C-like... — max9111, Mar 24 '21 at 09:40

Numba on pure python VS Numpa on numpy-python

2 Answers2

Linked