This question is slightly more complicated than Remove duplicate rows in pandas dataframe based on condition:
Instead of one 'valu' column, I now have two columns 'valu1', 'valu2':
t valu1 valu2
2015-08-01 1 10
2015-08-01 2 11
2015-08-01 3 12
2015-09-31 4 15
2015-10-31 5 13
In the dataframe above, I want to remove the duplicate rows (i.e. row where the column 't'
is repeated) by retaining the row with a higher value in the valu1
column and a lower value in the value2
column.
Expected outcome:
t valu1 valu2
2015-08-01 3 10
2015-09-31 4 15
2015-10-31 5 13
The df.sort_values()
and drop_duplicates
with keep='last'
mentioned in the linked question obviously don't work.
Something I can think of now is:
#Let's call the dataframe df
dups = df[df['t'].duplicated()]['t'].drop_duplicates() #get duplicated dates
for d in dups:
max_v1 = df[df['t'] == d]['valu1'].max() #find the max of valu1 on day d
min_v2 = df[df['t'] == d]['valu2'].min() #find the min of valu2 on day d
df[df['t'] == d]['valu1'] = max_v1 #set valu1 of day d to max_v1
df[df['t'] == d]['valu2'] = min_v2 #set valu2 of day d to min_v2
df = df[~df.index.duplicated()] #drop everything duplicated
I think this should work, but it really seems unsophisticated, especially I actually need to do this for a large dataset. Any idea of how I should approach this problem?