Get first row of dataframe in Python Pandas based on criteria

Question

Let's say that I have a DataFrame like this one

import pandas as pd

df = pd.DataFrame([
            [1, 2, 1], [1, 3, 2], [4, 6, 3], [4, 3, 4], [5, 4, 5]
        ], columns=['A', 'B', 'C'])

>> df
   A  B  C
0  1  2  1
1  1  3  2
2  4  6  3
3  4  3  4
4  5  4  5

The original DataFrame is more complicated with more columns and rows.

I want to get the first row that fulfills some criteria. Examples:

Get first row where A > 3 (returns row 2)
Get first row where A > 4 AND B > 3 (returns row 4)
Get first row where A > 3 AND (B > 3 OR C > 2) (returns row 2)

But, if there isn't any row that fulfills the specific criteria, then I want to get the first one after I just sort it descending by A (or other cases by B, C etc).

Get first row where A > 6 (returns row 4 by ordering it by A desc and get the first one)

I was able to do it by iterating on the DataFrame (I know that craps :P). So, what would be a more pythonic way to solve it?

[Boolean indexing](http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing)? — Kartik, Nov 17 '16 at 16:31
So, @Kartik I will use Boolean Indexing and IF the returned dataframe is empty, then I sort it manually and get the first row? This is the most appropriate way? — Tasos, Nov 17 '16 at 16:35
Boolean indexing probably is more than you want, since it has to run through the whole DataFrame. If the first row is the one you want, you should be able to find out immediately, regardless of the size of the DataFrame. — BallpointBen, Nov 17 '16 at 16:36
Is there a specific way the criteria relate to which column you want to sort on? Or is it different? If you have a list of criteria, you can use @Boud's answer to select the rows, and if the result has no rows, then you can use `argmax` to return the largest row. — Kartik, Nov 17 '16 at 16:41

score 106 · Accepted Answer · answered Nov 17 '16 at 16:42

This tutorial is a very good one for pandas slicing. Make sure you check it out. Onto some snippets... To slice a dataframe with a condition, you use this format:

>>> df[condition]

This will return a slice of your dataframe which you can index using iloc. Here are your examples:

Get first row where A > 3 (returns row 2)

>>> df[df.A > 3].iloc[0]
A    4
B    6
C    3
Name: 2, dtype: int64

If what you actually want is the row number, rather than using iloc, it would be df[df.A > 3].index[0].

Get first row where A > 4 AND B > 3:

>>> df[(df.A > 4) & (df.B > 3)].iloc[0]
A    5
B    4
C    5
Name: 4, dtype: int64

Get first row where A > 3 AND (B > 3 OR C > 2) (returns row 2)

>>> df[(df.A > 3) & ((df.B > 3) | (df.C > 2))].iloc[0]
A    4
B    6
C    3
Name: 2, dtype: int64

Now, with your last case we can write a function that handles the default case of returning the descending-sorted frame:

>>> def series_or_default(X, condition, default_col, ascending=False):
...     sliced = X[condition]
...     if sliced.shape[0] == 0:
...         return X.sort_values(default_col, ascending=ascending).iloc[0]
...     return sliced.iloc[0]
>>> 
>>> series_or_default(df, df.A > 6, 'A')
A    5
B    4
C    5
Name: 4, dtype: int64

As expected, it returns row 4.

In case your dataframe is very large, is this not wasteful? You're computing an entire series just to pull out its first element. — BallpointBen, Feb 19 '19 at 23:23

score 24 · Answer 2 · answered Nov 17 '16 at 16:33

24

For existing matches, use query:

df.query(' A > 3' ).head(1)
Out[33]: 
   A  B  C
2  4  6  3

df.query(' A > 4 and B > 3' ).head(1)
Out[34]: 
   A  B  C
4  5  4  5

df.query(' A > 3 and (B > 3 or C > 2)' ).head(1)
Out[35]: 
   A  B  C
2  4  6  3

answered Nov 17 '16 at 16:33

Zeugma

31,231
9
69
81

score 7 · Answer 3 · edited Nov 18 '16 at 16:43

7

you can take care of the first 3 items with slicing and head:

df[df.A>=4].head(1)
df[(df.A>=4)&(df.B>=3)].head(1)
df[(df.A>=4)&((df.B>=3) * (df.C>=2))].head(1)

The condition in case nothing comes back you can handle with a try or an if...

try:
    output = df[df.A>=6].head(1)
    assert len(output) == 1
except: 
    output = df.sort_values('A',ascending=False).head(1)

edited Nov 18 '16 at 16:43

TayTay

6,882
4
44
65

answered Nov 17 '16 at 16:37

PabTorre

2,878
21
30

ZeFeng Zhu · Answer 4 · 2020-09-10T15:03:01.543

For the point that 'returns the value as soon as you find the first row/record that meets the requirements and NOT iterating other rows', the following code would work:

def pd_iter_func(df):
    for row in df.itertuples():
        # Define your criteria here
        if row.A > 4 and row.B > 3:
            return row

It is more efficient than Boolean Indexing when it comes to a large dataframe.

To make the function above more applicable, one can implements lambda functions:

def pd_iter_func(df: DataFrame, criteria: Callable[[NamedTuple], bool]) -> Optional[NamedTuple]:
    for row in df.itertuples():
        if criteria(row):
            return row

pd_iter_func(df, lambda row: row.A > 4 and row.B > 3)

As mentioned in the answer to the 'mirror' question, pandas.Series.idxmax would also be a nice choice.

def pd_idxmax_func(df, mask):
    return df.loc[mask.idxmax()]

pd_idxmax_func(df, (df.A > 4) & (df.B > 3))

I like itertuples(). But with idxmax(), isn't the max eval'd on each row before idxmax() is run? — Zachary Ryan Smith, Dec 23 '20 at 12:15

Get first row of dataframe in Python Pandas based on criteria

4 Answers4

Linked

Related