pandas - drop duplicate rows in a dataframe based on a column value

Question

I have one df like this:

>>> df1

        col_1   col_2    size_col  other_col
0        aaa     abc       4          zxc
1        aaa     abc       3          xcv
2        aaa     abc       1          cvb
3        bbb     bbc       7          vbn
4        bbb     bbc       3          bnm
5        ccc     cbc       1          asd
6        ddd     dbc       9          sdf
7        ccc     cbc       3          dfg
8        ccc     cbc       1          fgh

and want a df like this:

>>> df2

        col_1   col_2    size_col  other_col
0        aaa     abc       4          zxc
3        bbb     bbc       7          vbn
6        ddd     dbc       9          sdf
7        ccc     cbc       3          dfg

Explanation:
I want to all drop the where col_1 and col_2 have similar values, and retain the rows where 'size_col' is greatest for all the duplicate bunch. so, from above example, for the rows, where col_1 and col_2 has aaa and abc, I need to retain the row where size_col has biggest value. or put other way, i need to group by col_1 and col_2 columns, then for each group, retain only the row where other_col have biggest value for the group.

How do I do this efficiently for a df with around 5 million rows and 7 columns?

score 2 · Accepted Answer · edited Oct 19 '21 at 06:43

2

Use:

df1.loc[df1.groupby(['col_1', 'col_2'])['size_col'].idxmax()]

edited Oct 19 '21 at 06:43

jezrael

822,522
95
1,334
1,252

answered Oct 19 '21 at 06:36

U13-Forward

69,221
14
89
114

all that's left is speed timing; over to u :grin: – sammywemmy Oct 19 '21 at 06:37
@sammywemmy Haha! I also almost fell into that `drop_duplicate` trap :P – U13-Forward Oct 19 '21 at 06:38
i have another numerical columns too. how do i ensure that the `max()` part actually takes `size_col` and not something else? – Naveen Reddy Marthala Oct 19 '21 at 06:39
2

@NaveenReddyMarthala - Check dupe. – jezrael Oct 19 '21 at 06:40
@U12-Forward - Can you convert to wiki? – jezrael Oct 19 '21 at 06:40
@NaveenReddyMarthala Check my last part – U13-Forward Oct 19 '21 at 06:41
@jezrael yes!!! – U13-Forward Oct 19 '21 at 06:42
@U12-Forward Super, now I can edit. – jezrael Oct 19 '21 at 06:43

pandas - drop duplicate rows in a dataframe based on a column value

1 Answers1