9
            is_avail   valu data_source
2015-08-07     False  0.282    source_a
2015-08-07     False  0.582    source_b
2015-08-23     False  0.296    source_a
2015-09-08     False  0.433    source_a
2015-10-01      True  0.169    source_b

In the dataframe above, I want to remove the duplicate rows (i.e. row where the index is repeated) by retaining the row with a higher value in the valu column.

I can remove rows with duplicate indexes like this:

df = df[~df.index.duplicated()]. But how to remove based on condition specified above?

user308827
  • 21,227
  • 87
  • 254
  • 417
  • this might help: http://stackoverflow.com/questions/13035764/remove-rows-with-duplicate-indices-pandas-dataframe-and-timeseries – Paul H May 05 '17 at 22:21

2 Answers2

8

You can use groupby on index after sorting the df by valu.

df.sort_values(by='valu', ascending=False).groupby(level=0).first()
Out[1277]: 
           is_avail   valu data_source
2015-08-07    False  0.582    source_b
2015-08-23    False  0.296    source_a
2015-09-08    False  0.433    source_a
2015-10-01     True  0.169    source_b
Allen Qin
  • 19,507
  • 8
  • 51
  • 67
  • 1
    This is the better solution! However, you don't need `ascending=False`. I'd change it to `df.sort_values('valu').groupby(level=0).tail(1)` – piRSquared May 05 '17 at 22:36
5

Using drop_duplicates with keep='last'

df.rename_axis('date').reset_index() \
    .sort_values(['date', 'valu']) \
    .drop_duplicates('date', keep='last') \
    .set_index('date').rename_axis(df.index.name)

           is_avail   valu data_source
2015-08-07    False  0.582    source_b
2015-08-23    False  0.296    source_a
2015-09-08    False  0.433    source_a
2015-10-01     True  0.169    source_b
piRSquared
  • 285,575
  • 57
  • 475
  • 624