103

I have a pandas dataframe which contains duplicates values according to two columns (A and B):

A B C
1 2 1
1 2 4
2 7 1
3 4 0
3 4 8

I want to remove duplicates keeping the row with max value in column C. This would lead to:

A B C
1 2 4
2 7 1
3 4 8

I cannot figure out how to do that. Should I use drop_duplicates(), something else?

smci
  • 32,567
  • 20
  • 113
  • 146
Elsalex
  • 1,177
  • 2
  • 8
  • 9

4 Answers4

125

You can do it using group by:

c_maxes = df.groupby(['A', 'B']).C.transform(max)
df = df.loc[df.C == c_maxes]

c_maxes is a Series of the maximum values of C in each group but which is of the same length and with the same index as df. If you haven't used .transform then printing c_maxes might be a good idea to see how it works.

Another approach using drop_duplicates would be

df.sort('C').drop_duplicates(subset=['A', 'B'], take_last=True)

Not sure which is more efficient but I guess the first approach as it doesn't involve sorting.

EDIT: From pandas 0.18 up the second solution would be

df.sort_values('C').drop_duplicates(subset=['A', 'B'], keep='last')

or, alternatively,

df.sort_values('C', ascending=False).drop_duplicates(subset=['A', 'B'])

In any case, the groupby solution seems to be significantly more performing:

%timeit -n 10 df.loc[df.groupby(['A', 'B']).C.max == df.C]
10 loops, best of 3: 25.7 ms per loop

%timeit -n 10 df.sort_values('C').drop_duplicates(subset=['A', 'B'], keep='last')
10 loops, best of 3: 101 ms per loop
wpercy
  • 9,636
  • 4
  • 33
  • 45
JoeCondron
  • 8,546
  • 3
  • 27
  • 28
  • 2
    don't forget to assign the new dataframe (in this case to df): `df.sort_values('C', ascending=False).drop_duplicates(subset=['A', 'B'], inplace=True)` – PV8 Nov 20 '19 at 10:25
  • Adding to @PV8 ' comment, don't use `inplace` at all in the expression as it will not give expected results, assignment is still needed as no inplace work is done. Its better to do operations explicitly to avoid surprises. – รยקคгรђשค Sep 25 '20 at 03:22
  • don't know what you are talking about, but the inplace command works in that case, check the answer to `drop_duplicates` https://stackoverflow.com/questions/23667369/drop-all-duplicate-rows-across-multiple-columns-in-python-pandas – PV8 Sep 25 '20 at 05:19
  • `take_last=True` is not working, use `keep='last'` instead as per docs: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html – M H Aug 02 '21 at 03:43
  • Please see the comments in the EDIT section. – JoeCondron Aug 02 '21 at 08:26
  • @PV8 If you want to use `inplace` here you would have to do it in two stages, first passed to `sort_values` and then to `drop_duplicates` since both return `None` when `inplace=True`. However, nothing in the question suggest we want to use `inplace` here and doing so loses. Your code will return `None` and will also not affect `df` since `.sort_values` returns a view or copy. – JoeCondron Aug 02 '21 at 08:31
69

You can do this simply by using pandas drop duplicates function

df.drop_duplicates(['A','B'],keep= 'last')
Sudharsan
  • 850
  • 7
  • 6
  • 1
    I know this is an old question, but if I have a df with like 15 columns and aren't exactly eager to type out all 11/15 of them, is there a way I could apply `df.drop_duplicates`, but set an exclusion for the first 3 and the very last column? but my df changes size each time (basically im using it to store data in a way easiest to export to csv), so I don't always know the index of the final column. – nos codemos Dec 27 '19 at 10:18
  • 1
    Why do you use keep='last'? – Guy s Nov 17 '21 at 11:38
14

I think groupby should work.

df.groupby(['A', 'B']).max()['C']

If you need a dataframe back you can chain the reset index call.

df.groupby(['A', 'B']).max()['C'].reset_index()
b10n
  • 1,166
  • 9
  • 8
  • This will just return a Series of the max value of C in each group, indexed by 'A' and 'B'. – JoeCondron Aug 19 '15 at 11:36
  • edited to include the option to get back a dataframe – b10n Aug 19 '15 at 11:40
  • Did you read the question? The problem is to remove rows keeping only the rows with the highest value of 'C' for each A-B group. You haven't proposed anything to solve that part of the problem – JoeCondron Aug 19 '15 at 11:42
  • `df.groupby(['A', 'B']).max()['C'].reset_index()` returns OPs expected output. – b10n Aug 19 '15 at 11:47
8

You can do it with drop_duplicates as you wanted

# initialisation
d = pd.DataFrame({'A' : [1,1,2,3,3], 'B' : [2,2,7,4,4],  'C' : [1,4,1,0,8]})

d = d.sort_values("C", ascending=False)
d = d.drop_duplicates(["A","B"])

If it's important to get the same order

d = d.sort_index()
AlexT
  • 81
  • 1
  • 4