62

I have a DataFrame with columns A, B, and C. For each value of A, I would like to select the row with the minimum value in column B.

That is, from this:

df = pd.DataFrame({'A': [1, 1, 1, 2, 2, 2],
                   'B': [4, 5, 2, 7, 4, 6],
                   'C': [3, 4, 10, 2, 4, 6]})      
    A   B   C
0   1   4   3
1   1   5   4
2   1   2   10
3   2   7   2
4   2   4   4
5   2   6   6  

I would like to get:

    A   B   C
0   1   2   10
1   2   4   4

For the moment I am grouping by column A, then creating a value that indicates to me the rows I will keep:

a = data.groupby('A').min()
a['A'] = a.index
to_keep = [str(x[0]) + str(x[1]) for x in a[['A', 'B']].values]
data['id'] = data['A'].astype(str) + data['B'].astype('str')
data[data['id'].isin(to_keep)]

I am sure that there is a much more straightforward way to do this. I have seen many answers here that use MultiIndex, which I would prefer to avoid.

Thank you for your help.

Mykola Zotko
  • 15,583
  • 3
  • 71
  • 73
Wendy
  • 791
  • 1
  • 5
  • 8

7 Answers7

77

I feel like you're overthinking this. Just use groupby and idxmin:

df.loc[df.groupby('A').B.idxmin()]

   A  B   C
2  1  2  10
4  2  4   4

df.loc[df.groupby('A').B.idxmin()].reset_index(drop=True)

   A  B   C
0  1  2  10
1  2  4   4
cs95
  • 379,657
  • 97
  • 704
  • 746
  • I am trying this solution but with pandas 1.0.0 I am getting the error: ``Passing list-likes to .loc or [] with any missing labels is no longer supported ``. @cs95 do you have any suggestions how to fix that? – Eve Edomenko Aug 18 '20 at 10:25
  • 4
    @cs95 this results in one row per A, what if there are multiple rows with the minimum value for each of the values in A. something like all the students with minimum marks in Science. – A-dude Aug 05 '21 at 11:22
13

Had a similar situation but with a more complex column heading (e.g. "B val") in which case this is needed:

df.loc[df.groupby('A')['B val'].idxmin()]
Nazim Kerimbekov
  • 4,712
  • 8
  • 34
  • 58
Juho
  • 141
  • 1
  • 3
5

The accepted answer (suggesting idxmin) cannot be used with the pipe pattern. A pipe-friendly alternative is to first sort values and then use groupby with DataFrame.head:

data.sort_values('B').groupby('A').apply(DataFrame.head, n=1)

This is possible because by default groupby preserves the order of rows within each group, which is stable and documented behaviour (see pandas.DataFrame.groupby).

This approach has additional benefits:

  • it can be easily expanded to select n rows with smallest values in specific column
  • it can break ties by providing another column (as a list) to .sort_values(), e.g.:
    data.sort_values(['final_score', 'midterm_score']).groupby('year').apply(DataFrame.head, n=1)
    

As with other answers, to exactly match the result desired in the question .reset_index(drop=True) is needed, making the final snippet:

df.sort_values('B').groupby('A').apply(DataFrame.head, n=1).reset_index(drop=True)
krassowski
  • 13,598
  • 4
  • 60
  • 92
  • 2
    Nice answer. I would complement saying that I did in this way and seems that works the same way: `data.sort_values('B').groupby('A').head(1)` – igorkf May 06 '22 at 13:46
5

You can sort_values and drop_duplicates:

df.sort_values('B').drop_duplicates('A')

Output:

   A  B   C
2  1  2  10
4  2  4   4
Mykola Zotko
  • 15,583
  • 3
  • 71
  • 73
3

I found an answer a little bit more wordy, but a lot more efficient:

This is the example dataset:

data = pd.DataFrame({'A': [1,1,1,2,2,2], 'B':[4,5,2,7,4,6], 'C':[3,4,10,2,4,6]})
data

Out:
   A  B   C
0  1  4   3
1  1  5   4
2  1  2  10
3  2  7   2
4  2  4   4
5  2  6   6 

First we will get the min values on a Series from a groupby operation:

min_value = data.groupby('A').B.min()
min_value

Out:
A
1    2
2    4
Name: B, dtype: int64

Then, we merge this series result on the original data frame

data = data.merge(min_value, on='A',suffixes=('', '_min'))
data

Out:
   A  B   C  B_min
0  1  4   3      2
1  1  5   4      2
2  1  2  10      2
3  2  7   2      4
4  2  4   4      4
5  2  6   6      4

Finally, we get only the lines where B is equal to B_min and drop B_min since we don't need it anymore.

data = data[data.B==data.B_min].drop('B_min', axis=1)
data

Out:
   A  B   C
2  1  2  10
4  2  4   4

I have tested it on very large datasets and this was the only way I could make it work in a reasonable time.

Dharman
  • 30,962
  • 25
  • 85
  • 135
1

The solution is, as written before ;

df.loc[df.groupby('A')['B'].idxmin()]

If the solution but then if you get an error;

"Passing list-likes to .loc or [] with any missing labels is no longer supported.
The following labels were missing: Float64Index([nan], dtype='float64').
See https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike"

In my case, there were 'NaN' values at column B. So, I used 'dropna()' then it worked.

df.loc[df.groupby('A')['B'].idxmin().dropna()]
yalin
  • 53
  • 2
  • 5
1

You can also boolean indexing the rows where B column is minimal value

out = df[df['B'] == df.groupby('A')['B'].transform('min')]
print(out)

   A  B   C
2  1  2  10
4  2  4   4
Ynjxsjmh
  • 28,441
  • 6
  • 34
  • 52