Pandas GroupBy and select rows with the minimum value in a specific column

Question

I have a DataFrame with columns A, B, and C. For each value of A, I would like to select the row with the minimum value in column B.

That is, from this:

df = pd.DataFrame({'A': [1, 1, 1, 2, 2, 2],
                   'B': [4, 5, 2, 7, 4, 6],
                   'C': [3, 4, 10, 2, 4, 6]})      
    A   B   C
0   1   4   3
1   1   5   4
2   1   2   10
3   2   7   2
4   2   4   4
5   2   6   6

I would like to get:

    A   B   C
0   1   2   10
1   2   4   4

For the moment I am grouping by column A, then creating a value that indicates to me the rows I will keep:

a = data.groupby('A').min()
a['A'] = a.index
to_keep = [str(x[0]) + str(x[1]) for x in a[['A', 'B']].values]
data['id'] = data['A'].astype(str) + data['B'].astype('str')
data[data['id'].isin(to_keep)]

I am sure that there is a much more straightforward way to do this. I have seen many answers here that use MultiIndex, which I would prefer to avoid.

Thank you for your help.

score 77 · Accepted Answer · answered Feb 01 '19 at 00:03

77

I feel like you're overthinking this. Just use groupby and idxmin:

df.loc[df.groupby('A').B.idxmin()]

   A  B   C
2  1  2  10
4  2  4   4

df.loc[df.groupby('A').B.idxmin()].reset_index(drop=True)

   A  B   C
0  1  2  10
1  2  4   4

answered Feb 01 '19 at 00:03

cs95

379,657
97
704
746

I am trying this solution but with pandas 1.0.0 I am getting the error: ``Passing list-likes to .loc or [] with any missing labels is no longer supported ``. @cs95 do you have any suggestions how to fix that? – Eve Edomenko Aug 18 '20 at 10:25
4

@cs95 this results in one row per A, what if there are multiple rows with the minimum value for each of the values in A. something like all the students with minimum marks in Science. – A-dude Aug 05 '21 at 11:22

score 13 · Answer 2 · edited Dec 15 '19 at 10:51

13

Had a similar situation but with a more complex column heading (e.g. "B val") in which case this is needed:

df.loc[df.groupby('A')['B val'].idxmin()]

edited Dec 15 '19 at 10:51

Nazim Kerimbekov

4,712
8
34
58

answered Dec 15 '19 at 10:50

Juho

141
1
3

krassowski · Answer 3 · 2022-02-15T16:42:16.460

The accepted answer (suggesting idxmin) cannot be used with the pipe pattern. A pipe-friendly alternative is to first sort values and then use groupby with DataFrame.head:

data.sort_values('B').groupby('A').apply(DataFrame.head, n=1)

This is possible because by default groupby preserves the order of rows within each group, which is stable and documented behaviour (see pandas.DataFrame.groupby).

This approach has additional benefits:

it can be easily expanded to select n rows with smallest values in specific column

it can break ties by providing another column (as a list) to .sort_values(), e.g.:

data.sort_values(['final_score', 'midterm_score']).groupby('year').apply(DataFrame.head, n=1)

As with other answers, to exactly match the result desired in the question .reset_index(drop=True) is needed, making the final snippet:

df.sort_values('B').groupby('A').apply(DataFrame.head, n=1).reset_index(drop=True)

Nice answer. I would complement saying that I did in this way and seems that works the same way: `data.sort_values('B').groupby('A').head(1)` — igorkf, May 06 '22 at 13:46

score 5 · Answer 4 · answered Oct 03 '22 at 20:01

5

You can sort_values and drop_duplicates:

df.sort_values('B').drop_duplicates('A')

Output:

   A  B   C
2  1  2  10
4  2  4   4

answered Oct 03 '22 at 20:01

Mykola Zotko

15,583
3
71
73

score 3 · Answer 5 · edited Sep 23 '21 at 23:31

I found an answer a little bit more wordy, but a lot more efficient:

This is the example dataset:

data = pd.DataFrame({'A': [1,1,1,2,2,2], 'B':[4,5,2,7,4,6], 'C':[3,4,10,2,4,6]})
data

Out:
   A  B   C
0  1  4   3
1  1  5   4
2  1  2  10
3  2  7   2
4  2  4   4
5  2  6   6

First we will get the min values on a Series from a groupby operation:

min_value = data.groupby('A').B.min()
min_value

Out:
A
1    2
2    4
Name: B, dtype: int64

Then, we merge this series result on the original data frame

data = data.merge(min_value, on='A',suffixes=('', '_min'))
data

Out:
   A  B   C  B_min
0  1  4   3      2
1  1  5   4      2
2  1  2  10      2
3  2  7   2      4
4  2  4   4      4
5  2  6   6      4

Finally, we get only the lines where B is equal to B_min and drop B_min since we don't need it anymore.

data = data[data.B==data.B_min].drop('B_min', axis=1)
data

Out:
   A  B   C
2  1  2  10
4  2  4   4

I have tested it on very large datasets and this was the only way I could make it work in a reasonable time.

very nice solution and easy to follow. – Niccola Tartaglia May 20 '22 at 15:23 — Niccola Tartaglia, May 20 '22 at 15:23

score 1 · Answer 6 · answered Aug 15 '22 at 01:18

The solution is, as written before ;

df.loc[df.groupby('A')['B'].idxmin()]

If the solution but then if you get an error;

"Passing list-likes to .loc or [] with any missing labels is no longer supported.
The following labels were missing: Float64Index([nan], dtype='float64').
See https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike"

In my case, there were 'NaN' values at column B. So, I used 'dropna()' then it worked.

df.loc[df.groupby('A')['B'].idxmin().dropna()]

score 1 · Answer 7 · answered Aug 17 '22 at 05:07

1

You can also boolean indexing the rows where B column is minimal value

out = df[df['B'] == df.groupby('A')['B'].transform('min')]

print(out)

   A  B   C
2  1  2  10
4  2  4   4

answered Aug 17 '22 at 05:07

Ynjxsjmh

28,441
6
34
52

Thanks, that is what I was looking for – Mikhail Genkin Aug 23 '23 at 21:32

Pandas GroupBy and select rows with the minimum value in a specific column

7 Answers7

Linked

Related