80

Let's say that I have a DataFrame like this one

import pandas as pd

df = pd.DataFrame([
            [1, 2, 1], [1, 3, 2], [4, 6, 3], [4, 3, 4], [5, 4, 5]
        ], columns=['A', 'B', 'C'])

>> df
   A  B  C
0  1  2  1
1  1  3  2
2  4  6  3
3  4  3  4
4  5  4  5

The original DataFrame is more complicated with more columns and rows.

I want to get the first row that fulfills some criteria. Examples:

  1. Get first row where A > 3 (returns row 2)
  2. Get first row where A > 4 AND B > 3 (returns row 4)
  3. Get first row where A > 3 AND (B > 3 OR C > 2) (returns row 2)

But, if there isn't any row that fulfills the specific criteria, then I want to get the first one after I just sort it descending by A (or other cases by B, C etc).

  1. Get first row where A > 6 (returns row 4 by ordering it by A desc and get the first one)

I was able to do it by iterating on the DataFrame (I know that craps :P). So, what would be a more pythonic way to solve it?

vvv444
  • 2,764
  • 1
  • 14
  • 25
Tasos
  • 7,325
  • 18
  • 83
  • 176
  • [Boolean indexing](http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing)? – Kartik Nov 17 '16 at 16:31
  • So, @Kartik I will use Boolean Indexing and IF the returned dataframe is empty, then I sort it manually and get the first row? This is the most appropriate way? – Tasos Nov 17 '16 at 16:35
  • 2
    Boolean indexing probably is more than you want, since it has to run through the whole DataFrame. If the first row is the one you want, you should be able to find out immediately, regardless of the size of the DataFrame. – BallpointBen Nov 17 '16 at 16:36
  • 1
    Is there a specific way the criteria relate to which column you want to sort on? Or is it different? If you have a list of criteria, you can use @Boud's answer to select the rows, and if the result has no rows, then you can use `argmax` to return the largest row. – Kartik Nov 17 '16 at 16:41

4 Answers4

106

This tutorial is a very good one for pandas slicing. Make sure you check it out. Onto some snippets... To slice a dataframe with a condition, you use this format:

>>> df[condition]

This will return a slice of your dataframe which you can index using iloc. Here are your examples:

  1. Get first row where A > 3 (returns row 2)

    >>> df[df.A > 3].iloc[0]
    A    4
    B    6
    C    3
    Name: 2, dtype: int64
    

If what you actually want is the row number, rather than using iloc, it would be df[df.A > 3].index[0].

  1. Get first row where A > 4 AND B > 3:

    >>> df[(df.A > 4) & (df.B > 3)].iloc[0]
    A    5
    B    4
    C    5
    Name: 4, dtype: int64
    
  2. Get first row where A > 3 AND (B > 3 OR C > 2) (returns row 2)

    >>> df[(df.A > 3) & ((df.B > 3) | (df.C > 2))].iloc[0]
    A    4
    B    6
    C    3
    Name: 2, dtype: int64
    

Now, with your last case we can write a function that handles the default case of returning the descending-sorted frame:

>>> def series_or_default(X, condition, default_col, ascending=False):
...     sliced = X[condition]
...     if sliced.shape[0] == 0:
...         return X.sort_values(default_col, ascending=ascending).iloc[0]
...     return sliced.iloc[0]
>>> 
>>> series_or_default(df, df.A > 6, 'A')
A    5
B    4
C    5
Name: 4, dtype: int64

As expected, it returns row 4.

TayTay
  • 6,882
  • 4
  • 44
  • 65
  • 42
    In case your dataframe is very large, is this not wasteful? You're computing an entire series just to pull out its first element. – BallpointBen Feb 19 '19 at 23:23
24

For existing matches, use query:

df.query(' A > 3' ).head(1)
Out[33]: 
   A  B  C
2  4  6  3

df.query(' A > 4 and B > 3' ).head(1)
Out[34]: 
   A  B  C
4  5  4  5

df.query(' A > 3 and (B > 3 or C > 2)' ).head(1)
Out[35]: 
   A  B  C
2  4  6  3
Zeugma
  • 31,231
  • 9
  • 69
  • 81
7

you can take care of the first 3 items with slicing and head:

  1. df[df.A>=4].head(1)
  2. df[(df.A>=4)&(df.B>=3)].head(1)
  3. df[(df.A>=4)&((df.B>=3) * (df.C>=2))].head(1)

The condition in case nothing comes back you can handle with a try or an if...

try:
    output = df[df.A>=6].head(1)
    assert len(output) == 1
except: 
    output = df.sort_values('A',ascending=False).head(1)
TayTay
  • 6,882
  • 4
  • 44
  • 65
PabTorre
  • 2,878
  • 21
  • 30
7

For the point that 'returns the value as soon as you find the first row/record that meets the requirements and NOT iterating other rows', the following code would work:

def pd_iter_func(df):
    for row in df.itertuples():
        # Define your criteria here
        if row.A > 4 and row.B > 3:
            return row

It is more efficient than Boolean Indexing when it comes to a large dataframe.

To make the function above more applicable, one can implements lambda functions:

def pd_iter_func(df: DataFrame, criteria: Callable[[NamedTuple], bool]) -> Optional[NamedTuple]:
    for row in df.itertuples():
        if criteria(row):
            return row

pd_iter_func(df, lambda row: row.A > 4 and row.B > 3)

As mentioned in the answer to the 'mirror' question, pandas.Series.idxmax would also be a nice choice.

def pd_idxmax_func(df, mask):
    return df.loc[mask.idxmax()]

pd_idxmax_func(df, (df.A > 4) & (df.B > 3))
ZeFeng Zhu
  • 124
  • 1
  • 5