There's a price you pay for short-circuiting. You need to introduce branches in your code.
The problem with branches (e.g. if
statements) is that they can be slower than using alternative operations (without branches) and then you also have branch prediction which could include a significant overhead.
Also depending on the compiler and processor the branchless code could use processor vectorization. I'm not an expert in this but maybe some sort of SIMD or SSE?
I'll use numba here because the code is easy to read and it's fast enough so the performance will change based on these small differences:
import numba as nb
import numpy as np
@nb.njit
def any_sc(arr):
for item in arr:
if item:
return True
return False
@nb.njit
def any_not_sc(arr):
res = False
for item in arr:
res |= item
return res
arr = np.zeros(100000, dtype=bool)
assert any_sc(arr) == any_not_sc(arr)
%timeit any_sc(arr)
# 126 µs ± 7.12 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit any_not_sc(arr)
# 15.5 µs ± 962 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit arr.any()
# 31.1 µs ± 184 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
It's almost 10 times faster in the worst case without branches. But in the best case the short-circuit function is much faster:
arr = np.zeros(100000, dtype=bool)
arr[0] = True
%timeit any_sc(arr)
# 1.97 µs ± 12.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit any_not_sc(arr)
# 15.1 µs ± 368 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit arr.any()
# 31.2 µs ± 2.23 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
So it's a question what case should be optimized: The best case? The worst case? The average case (what's the average case with any
)?
It could be that the NumPy developers wanted to optimize the worst case and not the best case. Or they just didn't care? Or maybe they just wanted "predictable" performance in any case.
Just a note on your code: You measure the time it takes to create an array as well as the time it takes to execute any
. If any
were short-circuit you wouldn't have noticed it with your code!
%timeit np.ones(10**6)
# 9.12 ms ± 635 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit np.ones(10**7)
# 86.2 ms ± 5.15 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
For conclusive timings supporting your question you should have used this instead:
arr1 = np.ones(10**6)
arr2 = np.ones(10**7)
%timeit arr1.any()
# 4.04 ms ± 121 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit arr2.any()
# 39.8 ms ± 1.34 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)