7

I need to find the index of the first value in a 1d NumPy array, or Pandas numeric series, satisfying a condition. The array is large and the index may be near the start or end of the array, or the condition may not be met at all. I can't tell in advance which is more likely. If the condition is not met, the return value should be -1. I've considered a few approaches.

Attempt 1

# func(arr) returns a Boolean array
idx = next(iter(np.where(func(arr))[0]), -1)

But this is often too slow as func(arr) applies a vectorised function on the entire array rather than stopping when the condition is met. Specifically, it is expensive when the condition is met near the start of the array.

Attempt 2

np.argmax is marginally faster, but fails to identify when a condition is never met:

np.random.seed(0)
arr = np.random.rand(10**7)

assert next(iter(np.where(arr > 0.999999)[0]), -1) == np.argmax(arr > 0.999999)

%timeit next(iter(np.where(arr > 0.999999)[0]), -1)  # 21.2 ms
%timeit np.argmax(arr > 0.999999)                    # 17.7 ms

np.argmax(arr > 1.0) returns 0, i.e. an instance when the condition is not satisfied.

Attempt 3

# func(arr) returns a Boolean scalar
idx = next((idx for idx, val in enumerate(arr) if func(arr)), -1)

But this is too slow when the condition is met near the end of the array. Presumably this is because the generator expression has an expensive overhead from a large number of __next__ calls.

Is this always a compromise or is there a way, for generic func, to extract the first index efficiently?

Benchmarking

For benchmarking, assume func finds the index when a value is greater than a given constant:

# Python 3.6.5, NumPy 1.14.3, Numba 0.38.0
import numpy as np

np.random.seed(0)
arr = np.random.rand(10**7)
m = 0.9
n = 0.999999

# Start of array benchmark
%timeit next(iter(np.where(arr > m)[0]), -1)                       # 43.5 ms
%timeit next((idx for idx, val in enumerate(arr) if val > m), -1)  # 2.5 µs

# End of array benchmark
%timeit next(iter(np.where(arr > n)[0]), -1)                       # 21.4 ms
%timeit next((idx for idx, val in enumerate(arr) if val > n), -1)  # 39.2 ms
jpp
  • 159,742
  • 34
  • 281
  • 339

2 Answers2

7

numba

With numba it's possible to optimise both scenarios. Syntactically, you need only construct a function with a simple for loop:

from numba import njit

@njit
def get_first_index_nb(A, k):
    for i in range(len(A)):
        if A[i] > k:
            return i
    return -1

idx = get_first_index_nb(A, 0.9)

Numba improves performance by JIT ("Just In Time") compiling code and leveraging CPU-level optimisations. A regular for loop without the @njit decorator would typically be slower than the methods you've already tried for the case where the condition is met late.

For a Pandas numeric series df['data'], you can simply feed the NumPy representation to the JIT-compiled function:

idx = get_first_index_nb(df['data'].values, 0.9)

Generalisation

Since numba permits functions as arguments, and assuming the passed the function can also be JIT-compiled, you can arrive at a method to calculate the nth index where a condition is met for an arbitrary func.

@njit
def get_nth_index_count(A, func, count):
    c = 0
    for i in range(len(A)):
        if func(A[i]):
            c += 1
            if c == count:
                return i
    return -1

@njit
def func(val):
    return val > 0.9

# get index of 3rd value where func evaluates to True
idx = get_nth_index_count(arr, func, 3)

For the 3rd last value, you can feed the reverse, arr[::-1], and negate the result from len(arr) - 1, the - 1 necessary to account for 0-indexing.

Performance benchmarking

# Python 3.6.5, NumPy 1.14.3, Numba 0.38.0

np.random.seed(0)
arr = np.random.rand(10**7)
m = 0.9
n = 0.999999

@njit
def get_first_index_nb(A, k):
    for i in range(len(A)):
        if A[i] > k:
            return i
    return -1

def get_first_index_np(A, k):
    for i in range(len(A)):
        if A[i] > k:
            return i
    return -1

%timeit get_first_index_nb(arr, m)                                 # 375 ns
%timeit get_first_index_np(arr, m)                                 # 2.71 µs
%timeit next(iter(np.where(arr > m)[0]), -1)                       # 43.5 ms
%timeit next((idx for idx, val in enumerate(arr) if val > m), -1)  # 2.5 µs

%timeit get_first_index_nb(arr, n)                                 # 204 µs
%timeit get_first_index_np(arr, n)                                 # 44.8 ms
%timeit next(iter(np.where(arr > n)[0]), -1)                       # 21.4 ms
%timeit next((idx for idx, val in enumerate(arr) if val > n), -1)  # 39.2 ms
jpp
  • 159,742
  • 34
  • 281
  • 339
0

I also wanted to do something similar and found out that the solutions presented in this question do not really help me. In particular, the numba solution was much slower for me than the more conventional methods presented in the question itself. I have a times_all list, typically of the order of tens of thousands of elements, and want to find the index of the first element of times_all which is bigger than a time_event. And I have thousands of time_events. My solution is to divide times_all into chunks of for example 100 elements, first decide time_event belongs to which time segment, keep the index of the first element of this segment, then find which index in that segment, and add the two indices. Here is a minimal code. For me, it runs of orders of magnitude faster than other solutions in this page.

def event_time_2_index(time_event, times_all, STEPS=100):
    import numpy as np
    time_indices_jumps = np.arange(0, len(times_all), STEPS)
    time_list_jumps = [times_all[idx] for idx in time_indices_jumps]

    time_list_jumps_idx = next((idx for idx, val in enumerate(time_list_jumps)\
                          if val > time_event), -1)
    index_in_jumps = time_indices_jumps[time_list_jumps_idx-1]
    times_cropped = times_all[index_in_jumps:]
    event_index_rel = next((idx for idx, val in enumerate(times_cropped) \
                      if val > time_event), -1)

    event_index = event_index_rel + index_in_jumps
    return event_index
Vahid S. Bokharaie
  • 937
  • 1
  • 9
  • 25
  • Can you provide some sample inputs to demonstrate how this is faster? I am surprised (except in the specific case where a condition is met very early) that a generator expression would be efficient. Your logic with `next` + generator expression is essentially my attempt #3. – jpp Nov 05 '19 at 21:20
  • The data I work on are experimental data and at this stage I cannot share them. But I have a sorted array of time steps, with steps of 1/320 seconds, and aprox. 1e5 samples, and another array of event times, which is typically of the order of thousands. And I need the index of these events, to use in the EEG analysis tool. Using this segmentation trick, for 1e5 sample, maximum number of comparisons is 1000+100, but without this segmentations, can be anything up to 1e5-1. I used the `next` generator because in the benchmarks you have done was the fastest, and also it is just one line. – Vahid S. Bokharaie Nov 06 '19 at 09:05
  • And also, for me, `numba` function was slower than the other solutions, which was not what I expected. Although I should say I run my code on Spyder, which I know is really bad in memory management, so maybe that has played a role: https://stackoverflow.com/questions/57409470/spyder-does-not-realease-memory-for-matplotlib-plots – Vahid S. Bokharaie Nov 06 '19 at 09:11
  • `I have a sorted array of time steps` - that's an additional assumption which can't be assumed from the question. I see where you're going with this, but as such I believe your answer is a possibly good one to a *different* question. If you were to write your own Q&A with the additional criteria, it would probably get a better reception. [Though you *should* mock up example input data, like I have in my Q&A.] – jpp Nov 06 '19 at 09:35
  • I googled for my problem, was led to this Q&A, the solutions did not help, an idea came to my mind which helped me to do what I want to do in hours rather than days, and thought to share the idea with whoever else that might be led to this corner of the virtual world. If it helps somebody else, that's good, but if it is not well-received, I don't give a flying flamingo! – Vahid S. Bokharaie Nov 06 '19 at 11:35