I have one df like this:
>>> df1
col_1 col_2 size_col other_col
0 aaa abc 4 zxc
1 aaa abc 3 xcv
2 aaa abc 1 cvb
3 bbb bbc 7 vbn
4 bbb bbc 3 bnm
5 ccc cbc 1 asd
6 ddd dbc 9 sdf
7 ccc cbc 3 dfg
8 ccc cbc 1 fgh
and want a df like this:
>>> df2
col_1 col_2 size_col other_col
0 aaa abc 4 zxc
3 bbb bbc 7 vbn
6 ddd dbc 9 sdf
7 ccc cbc 3 dfg
Explanation:
I want to all drop the where col_1
and col_2
have similar values, and retain the rows where 'size_col' is greatest for all the duplicate bunch. so, from above example, for the rows, where col_1
and col_2
has aaa
and abc
, I need to retain the row where size_col
has biggest value. or put other way, i need to group by col_1
and col_2
columns, then for each group, retain only the row where other_col
have biggest value for the group.
How do I do this efficiently for a df with around 5 million rows and 7 columns?