5

I have a Pandas DataFrame that contains few millions of rows. I want to select a value from a row based on a condition C.

I have the following code that is working :

all_matches= df.loc[C, "column_name"]
first_match = next(iter(all_matches), 'no match')

The problem is that it is extremely ineficient. I would like to know how is it possible to do something similar as df.loc[C, "column_name"], but stoping at the first match.

Nakeuh
  • 1,757
  • 3
  • 26
  • 65

2 Answers2

4

If always there is first value use Series.iat for fast get first value:

df.loc[C, "column_name"].iat[0]

Or:

df.loc[C, "column_name"].values[0]

Another solution is change this:

df = pd.DataFrame({'column_name':['a','b','va'],
                   'col':[1,2,3]})
from numba import njit

@njit
def get_first_val_nb(A, B, k):
    for i in range(len(A)):
        if A[i] > k:
            return B[i]
    return 'no match'

A = df['col'].values
B = df['column_name'].values

idx = get_first_val_nb(A,B, 2)
print (idx)
va
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
1

I tested and appears that at is faster than iat. The others are not suitable becaus they are either deprecated or they are vector grabbers.

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.rand(100, 100))

%timeit df.iat[50,50]=50 # ✓
%timeit df.at[50,50]=50 #  ✔
%timeit df.set_value(50,50,50) # will deprecate
%timeit df.iloc[50,50]=50
%timeit df.loc[50,50]=50

7.06 µs ± 118 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
5.52 µs ± 64.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
3.68 µs ± 80.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
98.7 µs ± 1.07 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
109 µs ± 1.42 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
prosti
  • 42,291
  • 14
  • 186
  • 151