0

I have a dataframe

d = pd.DataFrame({'d': ['d1', 'd1', 'd1', 'd1', 'd2', 'd2', 'd2','d2'], 's':[1, 2, 1, 1, 2, 2, 1, 2], 'v': [100, 5000, 200, 900, 100, 2000, 300, 300]})

     d  s   v
0   d1  1   100
1   d1  2   5000
2   d1  1   200
3   d1  1   900
4   d2  2   100
5   d2  2   2000
6   d2  1   300
7   d2  2   300

And I would like to keep only the rows that have the maximum value of v for each unique combination of columns d and s. So the result would look like

d = pd.DataFrame({'d': ['d1', 'd1', 'd2', 'd2'], 's':[1, 2, 1, 2], 'v': [900, 5000, 300, 2000]})

    d   s   v
0   d1  1   900
1   d1  2   5000
2   d2  1   300
3   d2  2   2000

How can I do this efficiently? In case of a tie, I would prefer to keep the first entry or use another column t (not shown in the example) as a tie breaker if that is easy to implement.

EDIT: For the tie breaker, I would have another column t. The dataframe below (slightly changed from the inital one) now has a tie at d2 for s=1, v=300

d = pd.DataFrame({'d': ['d1', 'd1', 'd1', 'd1', 'd2', 'd2', 'd2','d2'], 't':[1, 3, 5, 6, 2, 3, 8, 4], 's':[1, 2, 1, 1, 2, 2, 1, 1], 'v': [100, 5000, 200, 900, 100, 2000, 300, 300]})

    d   s   t   v
0   d1  1   1   100
1   d1  2   3   5000
2   d1  1   5   200
3   d1  1   6   900
4   d2  2   2   100
5   d2  2   3   2000
6   d2  1   8   300
7   d2  1   4   300

and the result should look like

d = pd.DataFrame({'d': ['d1', 'd1', 'd2', 'd2'], 't': [6, 3, 4, 3], 's':[1, 2, 1, 2], 'v': [900, 5000, 300, 2000]})

    d   s   t   v
0   d1  1   6   900
1   d1  2   3   5000
2   d2  1   4   300
3   d2  2   3   2000

So, we select line 7 over line 6 since t is smaller

chrise
  • 4,039
  • 3
  • 39
  • 74

1 Answers1

0

Use sort_values with drop_duplicates:

df1 = (d.sort_values(['d','s','v'], ascending=(True, True, False))
          .drop_duplicates(['d','s']))
print (df1)
    d  s     v
3  d1  1   900
1  d1  2  5000
6  d2  1   300
5  d2  2  2000

And:

df = (d.sort_values(['d','s','v', 't'], ascending=(True, True, False, True))
      .drop_duplicates(['d','s']))
print (df)
    d  s  t     v
3  d1  1  6   900
1  d1  2  3  5000
7  d2  1  4   300
5  d2  2  3  2000
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252