3

This question is slightly more complicated than Remove duplicate rows in pandas dataframe based on condition:

Instead of one 'valu' column, I now have two columns 'valu1', 'valu2':

         t    valu1    valu2
2015-08-01        1       10
2015-08-01        2       11
2015-08-01        3       12
2015-09-31        4       15
2015-10-31        5       13

In the dataframe above, I want to remove the duplicate rows (i.e. row where the column 't' is repeated) by retaining the row with a higher value in the valu1 column and a lower value in the value2 column.

Expected outcome:

         t    valu1    valu2
2015-08-01        3       10
2015-09-31        4       15
2015-10-31        5       13

The df.sort_values() and drop_duplicates with keep='last' mentioned in the linked question obviously don't work.

Something I can think of now is:

#Let's call the dataframe df
dups = df[df['t'].duplicated()]['t'].drop_duplicates()  #get duplicated dates
for d in dups:
    max_v1 = df[df['t'] == d]['valu1'].max()  #find the max of valu1 on day d
    min_v2 = df[df['t'] == d]['valu2'].min()  #find the min of valu2 on day d
    df[df['t'] == d]['valu1'] = max_v1        #set valu1 of day d to max_v1
    df[df['t'] == d]['valu2'] = min_v2        #set valu2 of day d to min_v2

df = df[~df.index.duplicated()]               #drop everything duplicated

I think this should work, but it really seems unsophisticated, especially I actually need to do this for a large dataset. Any idea of how I should approach this problem?

Barbaros Özhan
  • 59,113
  • 10
  • 31
  • 55
LSF
  • 97
  • 6

1 Answers1

5

I think you are looking for

df.groupby('t').agg({'valu1':'max','valu2':'min'}).reset_index()
            t  valu1  valu2
0  2015-08-01      3     10
1  2015-09-31      4     15
2  2015-10-31      5     13
BENY
  • 317,841
  • 20
  • 164
  • 234