1

I have one df like this:

>>> df1

        col_1   col_2    size_col  other_col
0        aaa     abc       4          zxc
1        aaa     abc       3          xcv
2        aaa     abc       1          cvb
3        bbb     bbc       7          vbn
4        bbb     bbc       3          bnm
5        ccc     cbc       1          asd
6        ddd     dbc       9          sdf
7        ccc     cbc       3          dfg
8        ccc     cbc       1          fgh

and want a df like this:

>>> df2

        col_1   col_2    size_col  other_col
0        aaa     abc       4          zxc
3        bbb     bbc       7          vbn
6        ddd     dbc       9          sdf
7        ccc     cbc       3          dfg

Explanation:
I want to all drop the where col_1 and col_2 have similar values, and retain the rows where 'size_col' is greatest for all the duplicate bunch. so, from above example, for the rows, where col_1 and col_2 has aaa and abc, I need to retain the row where size_col has biggest value. or put other way, i need to group by col_1 and col_2 columns, then for each group, retain only the row where other_col have biggest value for the group.

How do I do this efficiently for a df with around 5 million rows and 7 columns?

Naveen Reddy Marthala
  • 2,622
  • 4
  • 35
  • 67

1 Answers1

2

Use:

df1.loc[df1.groupby(['col_1', 'col_2'])['size_col'].idxmax()]
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
U13-Forward
  • 69,221
  • 14
  • 89
  • 114