Remove duplicates from dataframe, based on two columns A,B, keeping row with max value in another column C

Question

I have a pandas dataframe which contains duplicates values according to two columns (A and B):

I want to remove duplicates keeping the row with max value in column C. This would lead to:

I cannot figure out how to do that. Should I use drop_duplicates(), something else?

score 125 · Accepted Answer · edited Jan 30 '19 at 16:33

125

You can do it using group by:

c_maxes = df.groupby(['A', 'B']).C.transform(max)
df = df.loc[df.C == c_maxes]

c_maxes is a Series of the maximum values of C in each group but which is of the same length and with the same index as df. If you haven't used .transform then printing c_maxes might be a good idea to see how it works.

Another approach using drop_duplicates would be

df.sort('C').drop_duplicates(subset=['A', 'B'], take_last=True)

Not sure which is more efficient but I guess the first approach as it doesn't involve sorting.

EDIT: From pandas 0.18 up the second solution would be

df.sort_values('C').drop_duplicates(subset=['A', 'B'], keep='last')

or, alternatively,

df.sort_values('C', ascending=False).drop_duplicates(subset=['A', 'B'])

In any case, the groupby solution seems to be significantly more performing:

%timeit -n 10 df.loc[df.groupby(['A', 'B']).C.max == df.C]
10 loops, best of 3: 25.7 ms per loop

%timeit -n 10 df.sort_values('C').drop_duplicates(subset=['A', 'B'], keep='last')
10 loops, best of 3: 101 ms per loop

edited Jan 30 '19 at 16:33

wpercy

9,636
4
33
45

answered Aug 19 '15 at 11:34

JoeCondron

8,546
3
27
28

2

don't forget to assign the new dataframe (in this case to df): `df.sort_values('C', ascending=False).drop_duplicates(subset=['A', 'B'], inplace=True)` – PV8 Nov 20 '19 at 10:25
Adding to @PV8 ' comment, don't use `inplace` at all in the expression as it will not give expected results, assignment is still needed as no inplace work is done. Its better to do operations explicitly to avoid surprises. – รยקคгรђשค Sep 25 '20 at 03:22
don't know what you are talking about, but the inplace command works in that case, check the answer to `drop_duplicates` https://stackoverflow.com/questions/23667369/drop-all-duplicate-rows-across-multiple-columns-in-python-pandas – PV8 Sep 25 '20 at 05:19
`take_last=True` is not working, use `keep='last'` instead as per docs: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html – M H Aug 02 '21 at 03:43
Please see the comments in the EDIT section. – JoeCondron Aug 02 '21 at 08:26
@PV8 If you want to use `inplace` here you would have to do it in two stages, first passed to `sort_values` and then to `drop_duplicates` since both return `None` when `inplace=True`. However, nothing in the question suggest we want to use `inplace` here and doing so loses. Your code will return `None` and will also not affect `df` since `.sort_values` returns a view or copy. – JoeCondron Aug 02 '21 at 08:31

score 69 · Answer 2 · answered May 07 '18 at 08:24

69

You can do this simply by using pandas drop duplicates function

df.drop_duplicates(['A','B'],keep= 'last')

answered May 07 '18 at 08:24

Sudharsan

850
7
6

1

I know this is an old question, but if I have a df with like 15 columns and aren't exactly eager to type out all 11/15 of them, is there a way I could apply `df.drop_duplicates`, but set an exclusion for the first 3 and the very last column? but my df changes size each time (basically im using it to store data in a way easiest to export to csv), so I don't always know the index of the final column. – nos codemos Dec 27 '19 at 10:18
1

Why do you use keep='last'? – Guy s Nov 17 '21 at 11:38

b10n · Answer 3 · 2015-08-19T11:39:42.473

14

I think groupby should work.

df.groupby(['A', 'B']).max()['C']

If you need a dataframe back you can chain the reset index call.

df.groupby(['A', 'B']).max()['C'].reset_index()

edited Aug 19 '15 at 11:39

answered Aug 19 '15 at 11:17

b10n

1,166
9
8

This will just return a Series of the max value of C in each group, indexed by 'A' and 'B'. – JoeCondron Aug 19 '15 at 11:36
edited to include the option to get back a dataframe – b10n Aug 19 '15 at 11:40
Did you read the question? The problem is to remove rows keeping only the rows with the highest value of 'C' for each A-B group. You haven't proposed anything to solve that part of the problem – JoeCondron Aug 19 '15 at 11:42
`df.groupby(['A', 'B']).max()['C'].reset_index()` returns OPs expected output. – b10n Aug 19 '15 at 11:47

score 8 · Answer 4 · answered Dec 05 '17 at 13:47

You can do it with drop_duplicates as you wanted

# initialisation
d = pd.DataFrame({'A' : [1,1,2,3,3], 'B' : [2,2,7,4,4],  'C' : [1,4,1,0,8]})

d = d.sort_values("C", ascending=False)
d = d.drop_duplicates(["A","B"])

If it's important to get the same order

d = d.sort_index()

Remove duplicates from dataframe, based on two columns A,B, keeping row with max value in another column C

4 Answers4

Linked

Related