Why "numpy.any" has no short-circuit mechanism?

Question

I don't understand why a so basic optimization has not yet be done:

In [1]: one_million_ones = np.ones(10**6)
In [2]: %timeit one_million_ones.any()
100 loops, best of 3: 693µs per loop

In [3]: ten_millions_ones = np.ones(10**7)
In [4]: %timeit ten_millions_ones.any()
10 loops, best of 3: 7.03 ms per loop

The whole array is scanned, even if the conclusion is an evidence at first item.

Even though this question is not exactly a duplicate of [this question](https://stackoverflow.com/q/43973991/2454357), it still deals with the same topic. See also the answer, which has a very nice workaround. — Thomas Kühn, Aug 19 '17 at 12:50
Possible duplicate of [Check if two numpy arrays are identical](https://stackoverflow.com/questions/43973991/check-if-two-numpy-arrays-are-identical) — Thomas Kühn, Aug 19 '17 at 12:50
More frequently this is called `short-circuiting`. `lazy` means don't evaluate until needed. — hpaulj, Aug 19 '17 at 14:55
https://stackoverflow.com/q/41320568 - argmax short circuits on `np.nan`. — hpaulj, Aug 19 '17 at 15:12

user2357112 · Accepted Answer · 2017-12-17T23:06:47.737

It's an unfixed performance regression. NumPy issue 3446. There actually is short-circuiting logic, but a change to the ufunc.reduce machinery introduced an unnecessary chunk-based outer loop around the short-circuiting logic, and that outer loop doesn't know how to short circuit. You can see some explanation of the chunking machinery here.

The short-circuiting effects wouldn't have showed up in your test even without the regression, though. First, you're timing the array creation, and second, I don't think they ever put in the short-circuit logic for any input dtype but boolean. From the discussion, it sounds like the details of the ufunc reduction machinery behind numpy.any would have made that difficult.

The discussion does bring up the surprising point that the argmin and argmax methods appear to short-circuit for boolean input. A quick test shows that as of NumPy 1.12 (not quite the most recent version, but the version currently on Ideone), x[x.argmax()] short-circuits, and it outcompetes x.any() and x.max() for 1-dimensional boolean input no matter whether the input is small or large and no matter whether the short-circuiting pays off. Weird!

MSeifert · Answer 2 · 2017-08-19T16:30:30.220

There's a price you pay for short-circuiting. You need to introduce branches in your code.

The problem with branches (e.g. if statements) is that they can be slower than using alternative operations (without branches) and then you also have branch prediction which could include a significant overhead.

Also depending on the compiler and processor the branchless code could use processor vectorization. I'm not an expert in this but maybe some sort of SIMD or SSE?

I'll use numba here because the code is easy to read and it's fast enough so the performance will change based on these small differences:

import numba as nb
import numpy as np

@nb.njit
def any_sc(arr):
    for item in arr:
        if item:
            return True
    return False

@nb.njit
def any_not_sc(arr):
    res = False
    for item in arr:
        res |= item
    return res

arr = np.zeros(100000, dtype=bool)
assert any_sc(arr) == any_not_sc(arr)
%timeit any_sc(arr)
# 126 µs ± 7.12 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit any_not_sc(arr)
# 15.5 µs ± 962 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit arr.any()
# 31.1 µs ± 184 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

It's almost 10 times faster in the worst case without branches. But in the best case the short-circuit function is much faster:

arr = np.zeros(100000, dtype=bool)
arr[0] = True
%timeit any_sc(arr)
# 1.97 µs ± 12.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit any_not_sc(arr)
# 15.1 µs ± 368 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit arr.any()
# 31.2 µs ± 2.23 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

So it's a question what case should be optimized: The best case? The worst case? The average case (what's the average case with any)?

It could be that the NumPy developers wanted to optimize the worst case and not the best case. Or they just didn't care? Or maybe they just wanted "predictable" performance in any case.

Just a note on your code: You measure the time it takes to create an array as well as the time it takes to execute any. If any were short-circuit you wouldn't have noticed it with your code!

%timeit np.ones(10**6)
# 9.12 ms ± 635 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit np.ones(10**7)
# 86.2 ms ± 5.15 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

For conclusive timings supporting your question you should have used this instead:

arr1 = np.ones(10**6)
arr2 = np.ones(10**7)
%timeit arr1.any()
# 4.04 ms ± 121 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit arr2.any()
# 39.8 ms ± 1.34 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

I suspect your timings are being thrown off by some inefficiency in the Numba generated code. Short circuiting should not have that kind of disastrous effect when it doesn't pay off. The extra branch is very easily predicted. — user2357112, Aug 19 '17 at 18:23
@user2357112 yeah, it seems too much, but a branch always has overhead because even if the predictions are always correct it still needs to "check" at some point. It's also possible that numba realizes that the branchless one can use processor vectorizations and doesn't even attempt them in the first case. I haven't had time to investigate the particularities in my examples. I suspect that with some dedicated effort and coding this directly in C - the time difference in the worst case will be lower (maybe just a factor of 2 or less) but the branching code will be slower in the worst case. — MSeifert, Aug 19 '17 at 18:33
Yeah, but the thing is, the loop doesn't actually have to *wait* for the check. The check can happen in parallel with the loop continuing its work. I think correctly-predicted branches may be literally zero-latency these days. — user2357112, Aug 19 '17 at 18:41
@user2357112 I'm really not sure. I just tried it with cython: there both functions were roughly equally fast in the worst case but what struck me as odd was that both were almost as fast as the short-circuiting numba function. I suspect that numba may not be "inefficient" for the short-circuit case but maybe extremely efficient at the not-short-circuiting function. However, I don't really have time right now to really inspect the asm of numba or the code of cython. Maybe after the weekend. — MSeifert, Aug 19 '17 at 19:01

Why "numpy.any" has no short-circuit mechanism?

2 Answers2

Linked

Related