Why Numba doesn't improve this recursive function

Question

I have an array of true/false values with a very simple structure:

# the real array has hundreds of thousands of items
positions = np.array([True, False, False, False, True, True, True, True, False, False, False], dtype=np.bool)

I want to traverse this array and output the places where changes happen (true becomes false or the contrary). For this purpose, I've put together two different approaches:

a recursive binary search (see if all values are the same, if not, split in two, then recurse)
a purely iterative search (loop through all elements and compare with the previous/next one)

Both versions give exactly the result that I want, however Numba has a greater effect on one than another. With a dummy array of 300k values, here are the performance results:

Performance results with array of 300k elements

pure Python binary-search runs in 11 ms

pure Python iterative-search runs in 1.1 s (100x slower than binary-search)

Numba binary-search runs in 5 ms (2 times faster than pure Python equivalent)

Numba iterative-search runs in 900 µs (1,200 times faster than pure Python equivalent)

As a result, when using Numba, binary_search is 5x slower than iterative_search, while in theory it should be 100x faster (it should be expected to run in 9 µs if it was properly accelerated).

What can be done to make Numba accelerate binary-search as much as it accelerates iterative-search?

Code for both approaches (along with a sample position array) is available on this public gist: https://gist.github.com/JivanRoquet/d58989aa0a4598e060ec2c705b9f3d8f

Note: Numba is not running binary_search() in object mode, because when mentioning nopython=True, it doesn't complain and happily compiles the function.

Could you post a [mcve] of your algorithms? The optimization effect would depend greatly on how specifically they are implemented (e.g. Python loops vs NumPy vector operations). — ivan_pozdeev, Jun 14 '20 at 12:52
You should include the code in the text of the question itself (at least the salient portions). Should the link get broken, the question would still be useful. — norok2, Jun 14 '20 at 13:39

Jan Christoph Terasa · Accepted Answer · 2020-06-14T14:25:36.360

You can find the positions of value changes by using np.diff, there is no need to run a more complicated algorithm, or to use numba:

positions = np.array([True, False, False, False, True, True, True, True, False, False, False], dtype=np.bool)
dpos = np.diff(positions)
# array([ True, False, False,  True, False, False, False,  True, False, False])

This works, because False - True == -1 and np.bool(-1) == True.

It performs quite well on my battery powered (= throttled due to energy saving mode), and several years old laptop:

In [52]: positions = np.random.randint(0, 2, size=300_000, dtype=bool)          

In [53]: %timeit np.diff(positions)                                             
633 µs ± 4.09 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

I'd imagine that writing your own diff in numba should yield similar performance.

EDIT: The last statement is false, I implemented a simple diff function using numba, and it's more than a factor of 10 faster than the numpy one (but it obviously also has much less features, but should be sufficient for this task):

@numba.njit 
def ndiff(x): 
    s = x.size - 1 
    r = np.empty(s, dtype=x.dtype) 
    for i in range(s): 
        r[i] = x[i+1] - x[i] 
    return r

In [68]: np.all(ndiff(positions) == np.diff(positions))                            
Out[68]: True

In [69]: %timeit ndiff(positions)                                               
46 µs ± 138 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

On the premises that `np.diff()` might just do the trick, you [can accelerate this quite a bit](https://stackoverflow.com/a/62373366/5218354) if you do not need the extra functionalities provided by the NumPy function. — norok2, Jun 14 '20 at 14:18
@norok2 I just realized that myself and added the code. Thanks for the tip, though. :) — Jan Christoph Terasa, Jun 14 '20 at 14:19
@JanChristophTerasa — saying that you're a life-saviour would be the understatement of the year — Jivan, Jun 14 '20 at 14:23
JanChristophTerasa, it looks like you thought of using a Numba version of `np.diff()` at the same time as @norok2, but I'll give you the accepted answer because you came up first with the idea to use `np.diff()` in the first place, perfectly nailing the question-behind-my-question — Jivan, Jun 14 '20 at 14:43

norok2 · Answer 2 · 2020-06-14T14:16:27.937

3

The main issue is that you are not performing an apple-to-apple comparison. What you provide is not an iterative and a recursive version of the same algorithm. You are proposing two fundamentally different algorithms, which happen to be recursive/iterative.

In particular you are using NumPy built-ins a lot more in the recursive approach, so no wonder that there is such a staggering difference in the two approaches. It should also come at no surprise that the Numba JITting is more effective when you are avoiding NumPy built-ins. Eventually, the recursive algorithm seems to be less efficient as there is some hidden nested looping in the np.all() and np.any() calls that the iterative approach is avoiding, so even if you were to write all your code in pure Python to be accelerated with Numba more effectively, the recursive approach would be slower.

In general, iterative approaches are faster then the recursive equivalent, because they avoid the function call overhead (which is minimal for JIT accelerated functions compared to pure Python ones). So I would advise against trying to rewrite the algorithm in recursive form, only to discover that it is slower.

EDIT

On the premises that a simple np.diff() would do the trick, Numba can still be quite beneficial:

import numpy as np
import numba as nb


@nb.jit
def diff(arr):
    n = arr.size
    result = np.empty(n - 1, dtype=arr.dtype)
    for i in range(n - 1):
        result[i] = arr[i + 1] ^ arr[i]
    return result


positions = np.random.randint(0, 2, size=300_000, dtype=bool)
print(np.allclose(np.diff(positions), diff(positions)))
# True


%timeit np.diff(positions)
# 1000 loops, best of 3: 603 µs per loop
%timeit diff(positions)
# 10000 loops, best of 3: 43.3 µs per loop

with the Numba approach being some 13x faster (in this test, mileage may vary, of course).

edited Jun 14 '20 at 14:16

answered Jun 14 '20 at 13:45

norok2

25,683
4
73
99

1

Yep — I think that the troubles comes from the erroneous notion that my `binary_search()` would be a true binary search. It can't be, as the array is not sorted by values. It's only seemingly binary from a Python perspective, but in C the code actually does many, many, many repeated passes over the same chunks of array. It is a bit as if one attempted to make a binary search over an unsorted array (which makes no sense). – Jivan Jun 14 '20 at 13:49
I was under the assumption that `np.all()` and `np.any()` on boolean arrays were, even in NumPy's C implementation, O(1) operations, while they seem to be O(n). The belief was based on the idea that checking if all bit values (that's what booleans are) are the same over a contiguous chunk of memory would be O(1) because it's a simple `bitwise_or` but it looks like I was mistaken. Do you think it would in theory be possible and it's only NumPy's implementation that it not supporting it, or is it fundamentally impossible? – Jivan Jun 14 '20 at 13:52
It can be O(1) the same way that hash look-up can be O(1). It can, but you would need a data structure supporting that specifically to get O(1). For a C array of bytes or a NumPy array, that would be not possible. – norok2 Jun 14 '20 at 13:55
In other words, if we were to do it in C code, is it possible to assess that the bitwise_or value of a "big collection of bits between this and that memory address" is either 0 or 1 in one O(1) operation? – Jivan Jun 14 '20 at 13:55
1

Even `bitwise_or` is O(n). In C that is O(1) on the premises that you are working with fixed length words. If you were to do it on an array, that would be O(n). – norok2 Jun 14 '20 at 13:56
alright, perfectly clear. Thanks for confirming this. – Jivan Jun 14 '20 at 13:57

ivan_pozdeev · Answer 3 · 2020-06-14T13:18:31.507

1

The gist is, only the part of logic that uses Python machinery can be accelerated -- by replacing it with some equivalent C logic that strips away most of the complexity (and flexibility) of Python runtime (I presume this is what Numba does).

All the heavy lifting in NumPy operations is already implemented in C and very simple (since NumPy arrays are contiguous chunks of memory holding regular C types) so Numba can only strip the parts that interface with Python machinery.

Your "binary search" algorithm does much more work and makes much heavier use of NumPy's vector operations while at it, so less of it can be accelerated this way.

edited Jun 14 '20 at 13:18

answered Jun 14 '20 at 13:13

ivan_pozdeev

33,874
19
107
152

I think I can see why. If we look under the hood, the binary search is probably not really one, because all these calls to `arr.all()` and `arr.any()` which must perform checks for every single value (as the array is not sorted by value). Still, one would think that calling `arr.any()` on a NumPy array would be pretty fast as it's a bitwise operation, in theory possible for NumPy to do it in one pass over a contiguous chunk of memory. – Jivan Jun 14 '20 at 13:25
1

@Jivan the problem is, you iterate over the array many many times with `.any` and `.all` while "iterative search" only goes through it once. Pure Python operations are slow so it's initially slower but when compiled, a single iteration method emerges a clear winner. Cf. https://stackoverflow.com/questions/26816590/faster-alternatives-to-numpy-argmax-argmin-which-is-slow/26820109#26820109 – ivan_pozdeev Jun 14 '20 at 13:39

Why Numba doesn't improve this recursive function

Performance results with array of 300k elements

3 Answers3

EDIT

Linked

Related