Why is numba faster than numpy here?

Question

I can't figure out why numba is beating numpy here (over 3x). Did I make some fundamental error in how I am benchmarking here? Seems like the perfect situation for numpy, no? Note that as a check, I also ran a variation combining numba and numpy (not shown), which as expected was the same as running numpy without numba.

(btw this is a followup question to: Fastest way to numerically process 2d-array: dataframe vs series vs array vs numba )

import numpy as np
from numba import jit
nobs = 10000 

def proc_numpy(x,y,z):

   x = x*2 - ( y * 55 )      # these 4 lines represent use cases
   y = x + y*2               # where the processing time is mostly
   z = x + y + 99            # a function of, say, 50 to 200 lines
   z = z * ( z - .88 )       # of fairly simple numerical operations

   return z

@jit
def proc_numba(xx,yy,zz):
   for j in range(nobs):     # as pointed out by Llopis, this for loop 
      x, y = xx[j], yy[j]    # is not needed here.  it is here by 
                             # accident because in the original benchmarks 
      x = x*2 - ( y * 55 )   # I was doing data creation inside the function 
      y = x + y*2            # instead of passing it in as an array
      z = x + y + 99         # in any case, this redundant code seems to 
      z = z * ( z - .88 )    # have something to do with the code running
                             # faster.  without the redundant code, the 
      zz[j] = z              # numba and numpy functions are exactly the same.
   return zz

x = np.random.randn(nobs)
y = np.random.randn(nobs)
z = np.zeros(nobs)
res_numpy = proc_numpy(x,y,z)

z = np.zeros(nobs)
res_numba = proc_numba(x,y,z)

results:

In [356]: np.all( res_numpy == res_numba )
Out[356]: True

In [357]: %timeit proc_numpy(x,y,z)
10000 loops, best of 3: 105 µs per loop

In [358]: %timeit proc_numba(x,y,z)
10000 loops, best of 3: 28.6 µs per loop

I ran this on a 2012 macbook air (13.3), standard anaconda distribution. I can provide more detail on my setup if it's relevant.

I don't understand why in proc_numba you do the for loop and you don't in the proc_numpy — llrs, Sep 20 '14 at 17:13
@JohnE you should also bench with Numexpr (you will have to write it as a single string-like expression), but should be closer to numba perf - it avoids making temporaries — Jeff, Sep 20 '14 at 17:20
@Llopis Actually that's just a residual of how I originally wrote the benchmark. But the question remains, how would (rather stupidly) writing it as I did with the extra steps actually end up resulting in an over 3x speedup? Unless I'm just really fundamentally missing something (very likely). — JohnE, Sep 20 '14 at 17:24
@JohnE you can optimize the numpy code by doing things like: np.add(x,y, out=z) to avoid temporaries (it's not pretty to do this but should boost perf) — Jeff, Sep 20 '14 at 17:25
@Jeff OK, I have not explicitly used numexpr before but I'll try to figure it out and add it later. That's good to know about np.add(), but from a practical perspective I'm not sure why I just wouldn't use numba here if it lets me write things more simply. — JohnE, Sep 20 '14 at 17:26
@JohnE not saying u should use np.add with the out arg, just pointing out that's a reason why the difference exists — Jeff, Sep 20 '14 at 17:48

score 46 · Accepted Answer · answered Sep 20 '14 at 19:15

46

I think this question highlights (somewhat) the limitations of calling out to precompiled functions from a higher level language. Suppose in C++ you write something like:

for (int i = 0; i != N; ++i) a[i] = b[i] + c[i] + 2 * d[i];

The compiler sees all this at compile time, the whole expression. It can do a lot of really intelligent things here, including optimizing out temporaries (and loop unrolling).

In python however, consider what's happening: when you use numpy each ''+'' uses operator overloading on the np array types (which are just thin wrappers around contiguous blocks of memory, i.e. arrays in the low level sense), and calls out to a fortran (or C++) function which does the addition super fast. But it just does one addition, and spits out a temporary.

We can see that in some way, while numpy is awesome and convenient and pretty fast, it is slowing things down because while it seems like it is calling into a fast compiled language for the hard work, the compiler doesn't get to see the whole program, it's just fed isolated little bits. And this is hugely detrimental to a compiler, especially modern compilers which are very intelligent and can retire multiple instructions per cycle when the code is well written.

Numba on the other hand, used a jit. So, at runtime it can figure out that the temporaries are not needed, and optimize them away. Basically, Numba has a chance to have the program compiled as a whole, numpy can only call small atomic blocks which themselves have been pre-compiled.

answered Sep 20 '14 at 19:15

Nir Friedman

17,108
2
44
72

Yeah, that makes sense. In my original benchmark (linked to above) I just looped over `z=x+y` 100 times and saw a 50x improvement for numba over numpy. I assumed the jit had reduced the loop from 100x to 1x and I guess it's really just the same thing here, more or less. – JohnE Sep 20 '14 at 19:34
@xiaodai I mean Julia is jitted so the julia version would probably be as fast as numba at least, probably faster. I don't know you can call one small problem like this a use case for julia though. More generally Julia is designed largely to try to be as convenient as python but have better performance in situations like this. – Nir Friedman Sep 11 '18 at 03:30
1

I think the point is that Python tends to delegate some tasks to faster languages but Julia compiles everything together so there is no distinction between thr fast part and the slow part so the user wont get confused like the OP here. – xiaodai Sep 11 '18 at 04:28
3

The numba jit-compiler isn't intelligently figuring out how to avoid temporaries or using any sort of whole-program optimization. The difference is that in the loop one explicitly _instructs_ the compiler to not make any temporaries, by coding everything as scalar operations. It's the same in Julia, if one writes it in 'ordinary' vectorized form, one gets temporaries and therefore numpy-like speed. Written as a loop or with dot-broadcasting temporaries are explicitly avoided. If the compiler actually _is_ clever, one can get loop unrolling and simd on top of that. – DNF Sep 11 '18 at 05:12
@xiaodai Yeah, I agree with your point just the usage of the term "use-case" threw me a bit. But anyhow we both agree Julia makes getting better performance in this situation much easier and was obviously a very fundamental aspect of its motivation and design. – Nir Friedman Sep 11 '18 at 18:10
@DNF I'm not really sure what point you are trying to make here. Coding things as scalars doesn't tell the language not to create temporaries; temporaries exist as a concept in most languages and are obvious optimization opportunities, scalar or not, they are just easier to exploit as scalars. Do you have any evidence that in in Julia, in vectorized form, it would not be optimized? That would be *amazingly* disappointing of Julia. Julia afaik uses LLVM for codegen, and llvm can do that optimization, e.g.: https://gcc.godbolt.org/z/lNkTnU – Nir Friedman Sep 11 '18 at 18:17
"Coding things as scalars doesn't tell the language not to create temporaries" Yes, sure it does. What do you mean? Naturally, temporary arrays are not created if you operate on scalars. And, obviously, temporary arrays are created when you write: `z = x * y + w` for a bunch of arrays. If you want to avoid that, you can use explicit loops, or dot broadcasting, like this: `z .= x .* y .+ w`. This blogpost explains it in some detail: https://julialang.org/blog/2017/01/moredots The point I was making was that it's not the compiler doing the observed optimization, but the programmer. – DNF Sep 11 '18 at 19:35
@DNF Working with scalars does not create temporary arrays, but temporary scalars, not too surprisingly. So when you just said "temporaries" without being specific, it was quite confusing. From the comments around the question, my understanding is that even without explicit for loops numba will perform the same. This optimization should be very doable automatically, from a number of different perspectives, so I find it very disappointing that Julia requires the programmer to add `.` everywhere to make it happen. – Nir Friedman Sep 11 '18 at 20:34
No, you get some speedup from numba, roughly 80% faster from a simple test. The explicit loops + `@jit` gives me 7x speedup. I don't know what you mean by 'temporary scalars', since I have never observed such a thing (are bitstype scalars ever heap allocated?). I'm sorry that you are disappointed, but the dots gives you guaranteed loop-fusion with no allocations, while llvm _might_ perhaps, on a lucky day, optimize it away. I really recommend the blog post, it doesn't just explain the performance optimization, but also how it will vectorize any function. It's one of my favorite parts of Julia. – DNF Sep 11 '18 at 20:51
The heap has nothing to do with temporaries. Temporaries in most languages are just anonymous results of sub-expressions that are immediately used in the broader expression. It's disappointing because this is well trodden ground. It always makes me sad to watch new technology unsolve solved problems. If Julia is widely used then no doubt people will on a regular basis be cancelling simulations taking too long after an hour or so, adding a bunch of `.`s, and re-running on a regular basis, cursing themselves for forgetting the `.`s all throughout. – Nir Friedman Sep 11 '18 at 21:55
2

Almost no languages can optimize away an intermediate array allocation in the general case, unless it can prove that each subexpression is pure, which is very difficult. Also, there are cases where intermediate arrays do give a speedup, for example if for some reason you are sorting one subresult. – saolof Sep 11 '18 at 22:32
@NirFriedman In everything I've written, allocations = heap allocations. What language can do what you are talking about? Certainly not llvm. As you can see, Numba fails at it, as expected. I _strongly_ recommend that you read the blog post if you haven't. – DNF Sep 12 '18 at 06:18
@DNF Err, I have no idea why you mention allocations = heap allocations? My comment about your lack of clarity was about the term *temporaries*, which is a thing for all types, not just heap allocation things like arrays. "Certainly not llvm": did you look at the godbolt link, I sent, at all? – Nir Friedman Sep 12 '18 at 14:43
@saolof See previous comment and this one. I read the post. It does not mention expression templates at all, which is a known technique that deterministically makes these kinds of optimization automatically. Expression templates are associated with statically typed languages, but the same concept can be easily used for dynamically typed languages, the only difference will be that the computation of the expression tree (which is cheap relative to operations on big vectors/arrays) will happen at runtime rather than compile time. – Nir Friedman Sep 12 '18 at 14:45
1

@NirFriedman This is far enough afield now. I just want to point out that by 'allocations' I mean 'allocation of temporaries' (heap allocations, because they are the reason for the slowness), and that, as you can check for yourself, the numba speedup is due to re-writing the computation as an explicit loop, not due to cleverness of the compiler. In other words, the last paragraph of your answer is wrong. – DNF Sep 12 '18 at 18:56
@DNF It's not wrong, because *I* don't write allocations, I write temporaries. And even if you write it element-wise, there are still scalar temporaries. And the only reason you *can* write it as an explicit python for loop is because numba gets to see the whole loop and make it more efficient (i.e. the loop checking is converted to a more efficient form than interpreted python which is crazy slow). Anyhow, I've given you way more than enough of my time, if you feel like my answer has shortcomings I'd encourage you to do what's recommended by the SO format and write your own. Thanks. – Nir Friedman Sep 12 '18 at 22:09
2

@NirFriedman - Your god bolt code is incorrect; you were missing a `return output;` in the `add` function. It even warned you about it! With that, then you can see that `add3` allocates two vectors even though it inlines `add`. Beyond that, I'd hope you can appreciate the huge difference in cost for an array temporary vs. a scalar temporary; the pedantic hair-splitting is not helpful. – mbauman Sep 13 '18 at 17:27
@MattB. Thanks for the catch. Indeed it's disappointing, but that's life. Expression templates (and similar) it is then. And I'm not hair splitting at all, I used the word temporaries in my answer referring to both scalar and array temporaries, that's entirely obvious by my first usage of it. To have someone talk past me for 10 comments by using their own definition of temporaries, different from both how I use it in my answer and how everyone defines temporaries, is not helpful. – Nir Friedman Sep 13 '18 at 21:25
1

Geez, tons of comments in the last day! You all know, old answers can be edited and new answers can be added at any time? ;-) – JohnE Sep 14 '18 at 08:32
Julia broadcast does the exact same thing as expression templates would in the special case of array expressions, and dot notation is sugar for it. You can apply vanilla expression templates in Julia just as easily as in C++, but anyone who writes generic code will generally expect a + b to be of type T if a and b are of type T, so overloading it to return something else that has to be unboxed to be used is generally bad practice. Broadcast lets you sandbox this to a domain specific language instead. If you want to do it with overloaded types, you're better off creating a monad imho. – saolof Sep 18 '18 at 16:49
@NirFriedman Thanks for the awesome answer. A minor comment: I don't think loop unrolling anyhow accelerates it, since evaluating the condition in the loop on each step (or using branch prediction) is way cheaper, than the operations inside the loop – Mikhail Genkin Sep 16 '20 at 14:45

score 27 · Answer 2 · answered Sep 20 '14 at 17:20

27

When you ask numpy to do:

x = x*2 - ( y * 55 )

It is internally translated to something like:

tmp1 = y * 55
tmp2 = x * 2
tmp3 = tmp2 - tmp1
x = tmp3

Each of those temps are arrays that have to be allocated, operated on, and then deallocated. Numba, on the other hand, handles things one item at a time, and doesn't have to deal with that overhead.

answered Sep 20 '14 at 17:20

Jaime

65,696
17
124
159

Hmmm... so basically my for loop had the unintended benefit of turning off numpy and thus avoided the temp arrays? – JohnE Sep 20 '14 at 18:50
Exactly... Thing is, were it not for the JIT compiler, the overhead of the Python loops and function calls are typically orders of magnitude slower than the extra array allocations. But if you were writing things directly in C, you would never do what numpy does internally! – Jaime Sep 21 '14 at 07:29
3

Thanks Jaime and everyone else here for the insights. Your answer and Nir's are fairly similar, I think Nir can use the rep points more than you so I'll give him the check. ;-) – JohnE Sep 21 '14 at 15:37

score 9 · Answer 3 · answered Sep 20 '14 at 17:58

Numba is generally faster than Numpy and even Cython (at least on Linux).

Here's a plot (stolen from Numba vs. Cython: Take 2): Benchmark on Numpy, Cython and Numba

In this benchmark, pairwise distances have been computed, so this may depend on the algorithm.

Note that this may be different on other Platforms, see this for Winpython (From WinPython Cython tutorial):

Benchmark on Numpy, Cython and Numba with Winpython

JohnE · Answer 4 · 2014-09-21T16:37:08.713

Instead of cluttering the original question further, I'll add some more stuff here in response to Jeff, Jaime, Veedrac:

def proc_numpy2(x,y,z):
   np.subtract( np.multiply(x,2), np.multiply(y,55),out=x)
   np.add( x, np.multiply(y,2),out=y)
   np.add(x,np.add(y,99),out=z) 
   np.multiply(z,np.subtract(z,.88),out=z)
   return z

def proc_numpy3(x,y,z):
   x *= 2
   x -= y*55
   y *= 2
   y += x
   z = x + y
   z += 99
   z *= (z-.88) 
   return z

My machine seems to be running a tad faster today than yesterday so here they are in comparison to proc_numpy (proc_numba is timing the same as before)

In [611]: %timeit proc_numpy(x,y,z)
10000 loops, best of 3: 103 µs per loop

In [612]: %timeit proc_numpy2(x,y,z)
10000 loops, best of 3: 92.5 µs per loop

In [613]: %timeit proc_numpy3(x,y,z)
10000 loops, best of 3: 85.1 µs per loop

Note that as I was writing proc_numpy2/3 that I started seeing some side effects so I made copies of x,y,z and passed the copies instead of re-using x,y,z. Also, the different functions sometimes had slight differences in precision, so some of the them didn't pass the equality tests but if you diff them, they are really close. I assume that is due to creating or (not creating) temp variables. E.g.:

In [458]: (res_numpy2 - res_numba)[:12]
Out[458]: 
array([ -7.27595761e-12,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,  -7.27595761e-12,   0.00000000e+00])

Also, it's pretty minor (about 10 µs) but using float literals (55. instead of 55) will also save a little time for numpy but doesn't help numba.

you have to use the out argument (the 3rd) to make this effective — Jeff, Sep 20 '14 at 18:51
Instead of using the functions, `x = x*2 - ( y * 55 )` should be written `x *= 2; x -= y*55`, and similar with the other lines. That avoids most temporaries with much less visual noise. — Veedrac, Sep 21 '14 at 05:09
@Veedrac OK, added that above. Not a huge difference but bigger than I was expecting. — JohnE, Sep 21 '14 at 15:23

Why is numba faster than numpy here?

4 Answers4

Linked