Delete duplicates based on certain criteria

Question

I have a dataframe that has duplicate ids on it. I want to delete the duplicates based on certain criteria from the count column. Basically, only keep with ids that have the highest count. Also, if there are duplicates that all have the same value in the count column, just keep the first one.

DataFrame:

ID   Status     Count
12   Match       2
12   NotMatch    1
13   MaybeMatch  3
14   NotMatch    2
15   NotMatch    3
16   Match       2
16   NotMatch    1
17   Match       1
17   MaybeMatch  2
18   Match       1
18   MaybeMatch  1
18   NotMatch    1

Result DataFrame

ID   Status     Count
12   Match       2
13   MaybeMatch  3
14   NotMatch    2
15   NotMatch    3
16   Match       2
17   MaybeMatch  2
18   Match       1

Thanks.

score 0 · Answer 1 · answered Jul 10 '18 at 11:42

You can sort and drop duplicates:

res = df.sort_values('Count', ascending=False)\
        .drop_duplicates('ID')

If you wish to maintain ordering and include duplicate maximums:

g = df.groupby('ID')['Count'].transform('max')
res = df[df['Count'] == g]

score 0 · Answer 2 · answered Jul 10 '18 at 11:43

0

First sort, and then to drop duplicates and keep the first record:

df = df.sort_values('Count', ascending = False)
df = df.drop_duplicates(subset=['id'], keep = 'first')

answered Jul 10 '18 at 11:43

Ankur Sinha

6,473
7
42
73

Delete duplicates based on certain criteria

2 Answers2