Pandas search first row that match condition efficiently

Question

I have a Pandas DataFrame that contains few millions of rows. I want to select a value from a row based on a condition C.

I have the following code that is working :

all_matches= df.loc[C, "column_name"]
first_match = next(iter(all_matches), 'no match')

The problem is that it is extremely ineficient. I would like to know how is it possible to do something similar as df.loc[C, "column_name"], but stoping at the first match.

jezrael · Accepted Answer · 2019-05-22T07:52:58.937

If always there is first value use Series.iat for fast get first value:

df.loc[C, "column_name"].iat[0]

Or:

df.loc[C, "column_name"].values[0]

Another solution is change this:

df = pd.DataFrame({'column_name':['a','b','va'],
                   'col':[1,2,3]})
from numba import njit

@njit
def get_first_val_nb(A, B, k):
    for i in range(len(A)):
        if A[i] > k:
            return B[i]
    return 'no match'

A = df['col'].values
B = df['column_name'].values

idx = get_first_val_nb(A,B, 2)
print (idx)
va

score 1 · Answer 2 · answered May 22 '19 at 08:14

I tested and appears that at is faster than iat. The others are not suitable becaus they are either deprecated or they are vector grabbers.

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.rand(100, 100))

%timeit df.iat[50,50]=50 # ✓
%timeit df.at[50,50]=50 #  ✔
%timeit df.set_value(50,50,50) # will deprecate
%timeit df.iloc[50,50]=50
%timeit df.loc[50,50]=50

7.06 µs ± 118 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
5.52 µs ± 64.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
3.68 µs ± 80.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
98.7 µs ± 1.07 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
109 µs ± 1.42 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Pandas search first row that match condition efficiently

2 Answers2