15

consider the array a

a = np.array([3, 3, np.nan, 3, 3, np.nan])

I could do

np.isnan(a).argmax()

But this requires finding all np.nan just to find the first.
Is there a more efficient way?


I've been trying to figure out if I can pass a parameter to np.argpartition such that np.nan get's sorted first as opposed to last.


EDIT regarding [dup].
There are several reasons this question is different.

  1. That question and answers addressed equality of values. This is in regards to isnan.
  2. Those answers all suffer from the same issue my answer faces. Note, I provided a perfectly valid answer but highlighted it's inefficiency. I'm looking to fix the inefficiency.

EDIT regarding second [dup].

Still addressing equality and question/answers are old and very possibly outdated.

fuglede
  • 17,388
  • 2
  • 54
  • 99
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • 2
    Possible duplicate of [Is there a Numpy function to return the first index of something in an array?](http://stackoverflow.com/questions/432112/is-there-a-numpy-function-to-return-the-first-index-of-something-in-an-array) – Delgan Dec 25 '16 at 11:01
  • 1
    Possible duplicate of [Numpy: find first index of value fast](http://stackoverflow.com/questions/7632963/numpy-find-first-index-of-value-fast) – fuglede Dec 25 '16 at 11:12
  • 1
    The 2nd dup addresses `short circuiting` alternatives. The `isnan` part this problem does not make it unique. But the dup is dated. – hpaulj Dec 25 '16 at 11:39
  • Cython/ctypes/JIT would probably be the way to go to `short circuiting` large arrays, as the second duplicate shows. – Imanol Luengo Dec 25 '16 at 11:55
  • 2
    To make this fair, you need to devise a timing framework. The short-circuiting times depend heavily on where that first `nan` occurs, while the whole array code is stable, depending only on the overall length of the array (and even then I have increase lengths above 1000 to see changes). – hpaulj Dec 25 '16 at 17:55

4 Answers4

12

It might also be worth to look into numba.jit; without it, the vectorized version will likely beat a straight-forward pure-Python search in most scenarios, but after compiling the code, the ordinary search will take the lead, at least in my testing:

In [63]: a = np.array([np.nan if i % 10000 == 9999 else 3 for i in range(100000)])

In [70]: %paste
import numba

def naive(a):
        for i in range(len(a)):
                if np.isnan(a[i]):
                        return i

def short(a):
        return np.isnan(a).argmax()

@numba.jit
def naive_jit(a):
        for i in range(len(a)):
                if np.isnan(a[i]):
                        return i

@numba.jit
def short_jit(a):
        return np.isnan(a).argmax()
## -- End pasted text --

In [71]: %timeit naive(a)
100 loops, best of 3: 7.22 ms per loop

In [72]: %timeit short(a)
The slowest run took 4.59 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 37.7 µs per loop

In [73]: %timeit naive_jit(a)
The slowest run took 6821.16 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 6.79 µs per loop

In [74]: %timeit short_jit(a)
The slowest run took 395.51 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 144 µs per loop

Edit: As pointed out by @hpaulj in their answer, numpy actually ships with an optimized short-circuited search whose performance is comparable with the JITted search above:

In [26]: %paste
def plain(a):
        return a.argmax()

@numba.jit
def plain_jit(a):
        return a.argmax()
## -- End pasted text --

In [35]: %timeit naive(a)
100 loops, best of 3: 7.13 ms per loop

In [36]: %timeit plain(a)
The slowest run took 4.37 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 7.04 µs per loop

In [37]: %timeit naive_jit(a)
100000 loops, best of 3: 6.91 µs per loop

In [38]: %timeit plain_jit(a)
10000 loops, best of 3: 125 µs per loop
fuglede
  • 17,388
  • 2
  • 54
  • 99
10

I'll nominate

a.argmax()

With @fuglede's test array:

In [1]: a = np.array([np.nan if i % 10000 == 9999 else 3 for i in range(100000)])
In [2]: np.isnan(a).argmax()
Out[2]: 9999
In [3]: np.argmax(a)
Out[3]: 9999
In [4]: a.argmax()
Out[4]: 9999

In [5]: timeit a.argmax()
The slowest run took 29.94 ....
10000 loops, best of 3: 20.3 µs per loop

In [6]: timeit np.isnan(a).argmax()
The slowest run took 7.82 ...
1000 loops, best of 3: 462 µs per loop

I don't have numba installed, so can compare that. But my speedup relative to short is greater than @fuglede's 6x.

I'm testing in Py3, which accepts <np.nan, while Py2 raises a runtime warning. But the code search suggests this isn't dependent on that comparison.

/numpy/core/src/multiarray/calculation.c PyArray_ArgMax plays with axes (moving the one of interest to the end), and delegates the action to arg_func = PyArray_DESCR(ap)->f->argmax, a function that depends on the dtype.

In numpy/core/src/multiarray/arraytypes.c.src it looks like BOOL_argmax short circuits, returning as soon as it encounters a True.

for (; i < n; i++) {
    if (ip[i]) {
        *max_ind = i;
        return 0;
    }
}

And @fname@_argmax also short circuits on maximal nan. np.nan is 'maximal' in argmin as well.

#if @isfloat@
    if (@isnan@(mp)) {
        /* nan encountered; it's maximal */
        return 0;
    }
#endif

Comments from experienced c coders are welcomed, but it appears to me that at least for np.nan, a plain argmax will be as fast you we can get.

Playing with the 9999 in generating a shows that the a.argmax time depends on that value, consistent with short circuiting.

hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • Now that I can think straight. This is wonderful! And it all comes down to `np.nan` being maximal. – piRSquared Dec 26 '16 at 04:48
  • Interesting! On my test setup, the plain `argmax` performs as well as the JITted search. Which I guess makes sense since apparently they're doing the same thing! I'll add the timings to the other answer. – fuglede Dec 26 '16 at 12:24
6

Here is a pythonic approach using itertools.takewhile():

from itertools import takewhile
sum(1 for _ in takewhile(np.isfinite, a))

Benchmark with generator_expression_within_next approach: 1

In [118]: a = np.repeat(a, 10000)

In [120]: %timeit next(i for i, j in enumerate(a) if np.isnan(j))
100 loops, best of 3: 12.4 ms per loop

In [121]: %timeit sum(1 for _ in takewhile(np.isfinite, a))
100 loops, best of 3: 11.5 ms per loop

But still (by far) slower than numpy approach:

In [119]: %timeit np.isnan(a).argmax()
100000 loops, best of 3: 16.8 µs per loop

1. The problem with this approach is using enumerate function. Which returns an enumerate object from the numpy array first (which is an iterator like object) and calling the generator function and next attribute of the iterator will take time.

Mazdak
  • 105,000
  • 18
  • 159
  • 188
3

When looking for the first match in various scenarios, we could iterate through and look for the first match and exit out on the first match rather than going/processing the entire array. So, we would have an approach using Python's next function , like so -

next((i for i, val in enumerate(a) if np.isnan(val)))

Sample runs -

In [192]: a = np.array([3, 3, np.nan, 3, 3, np.nan])

In [193]: next((i for i, val in enumerate(a) if np.isnan(val)))
Out[193]: 2

In [194]: a[2] = 10

In [195]: next((i for i, val in enumerate(a) if np.isnan(val)))
Out[195]: 5
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • I was going to suggest a loop with break, but this use of generator plus one `next` is equivalent. I wonder about the speed. – hpaulj Dec 25 '16 at 11:23
  • 3
    For a very large array and nan near the start, an iterate till first has a chance of being faster. But in general the compiled numpy functions will be faster even though they traverse the whole array. What's the required test suit. – hpaulj Dec 25 '16 at 11:29
  • I even tried to use a `deque` from `collections`. It didn't help. – piRSquared Dec 25 '16 at 11:30
  • 2
    http://stackoverflow.com/a/40426159/901925 in the first `duplicate` presents this generator next solution. – hpaulj Dec 25 '16 at 11:32