0

I have a dataframe that has duplicate ids on it. I want to delete the duplicates based on certain criteria from the count column. Basically, only keep with ids that have the highest count. Also, if there are duplicates that all have the same value in the count column, just keep the first one.

DataFrame:

ID   Status     Count
12   Match       2
12   NotMatch    1
13   MaybeMatch  3
14   NotMatch    2
15   NotMatch    3
16   Match       2
16   NotMatch    1
17   Match       1
17   MaybeMatch  2
18   Match       1
18   MaybeMatch  1
18   NotMatch    1

Result DataFrame

ID   Status     Count
12   Match       2
13   MaybeMatch  3
14   NotMatch    2
15   NotMatch    3
16   Match       2
17   MaybeMatch  2
18   Match       1

Thanks.

jpp
  • 159,742
  • 34
  • 281
  • 339
David
  • 487
  • 2
  • 6
  • 18

2 Answers2

0

You can sort and drop duplicates:

res = df.sort_values('Count', ascending=False)\
        .drop_duplicates('ID')

If you wish to maintain ordering and include duplicate maximums:

g = df.groupby('ID')['Count'].transform('max')
res = df[df['Count'] == g]
jpp
  • 159,742
  • 34
  • 281
  • 339
0

First sort, and then to drop duplicates and keep the first record:

df = df.sort_values('Count', ascending = False)
df = df.drop_duplicates(subset=['id'], keep = 'first')
Ankur Sinha
  • 6,473
  • 7
  • 42
  • 73