I am failly new with Pandas and I have a question about how to properly merge rows within my pandas datafame, based on the values of its columns.
I have a dataframe, df1
, which contains the following:
df1 = pd.DataFrame(data=[["shoe", 4, 6, 0.110, 10],
["bracelet", 22, 25, 0.115, 40],
["belt", 2, 5, 0.120, 12],
["socks", 1, 3, 0.422, 8],
["scarf", 10, 12, 0.630, 6],
["skirt", 4, 6, 0.9, 5],
["bag", 9, 13, 1.131, 4],
["watch", 1, 4, 1.8, 5],
["sweater", 4, 5, 5.5, 14],
["jeans", 1, 6, 5.6, 14]],
columns=['item','low','high','time', 'price'])
df1
item low high time price
0 shoe 4 6 0.110 10
1 bracelet 22 25 0.115 40
2 belt 2 5 0.120 12
3 socks 1 3 0.422 8
4 scarf 10 12 0.630 6
5 skirt 4 6 0.9 5
6 bag 9 13 1.131 4
7 watch 1 4 1.8 5
8 sweater 4 5 5.500 14
9 jeans 1 6 5.600 14
I would like to merge these rows based on the values in ['low', 'high', 'time', 'price']
and keep track of item
.
I would like to merge rows where consecutive rows are less than 1
(e.g. seconds) apart based on the time
column. Based on this criteria I want to merge rows 0 -- 7
. But, I only want actually merge them when there is an overlap between the range of ['low', 'high']
. Based on this socks
, belt
, shoe
, skirt
and watch
get merged together because they have ranges (1,3), (2,5), (4,6), (4,6), (1,4) and bracelet
does not get merged with other rows.
Of the rows that are about to be merge together, I would like to keep the item
, time
and price
for the row that had the highest price
. I would also like to update low
and high
such that it fully the covers the entire range of the merged rows. Finally, I would like two new columns time_min
and time_max
that will indicate the total time range of the merge columns. If rows that are to be merged have the same price
, take the one with the lowest time
(there will not be rows with exactly the same time
and price
).
Based on these rules I would expect the following dataframe, df2
:
df2 = pd.DataFrame(data=[["belt", 2, 6, 0.120, 12, 0.110, 0.422],
["bracelet", 22, 25, 0.115, 10, 0.115, 0.115],
["scarf", 9, 13, 0.130, 6, 0.630, 1.131],
["sweater", 1, 6, 5.5, 14, 5.5, 5.6]],
columns=['item','low','high','time', 'price', 'time_min', 'time_max'])
df2
item low high time price time_min time_max
0 belt 2 6 0.120 12 0.110 1.8
1 bracelet 22 25 0.115 10 0.115 0.115
2 scarf 9 13 0.130 6 0.630 1.131
3 sweater 1 6 5.500 14 5.500 5.600
I have looked at the Q&A as provided by: merge rows pandas dataframe based on condition, but in that question they only merge on (in this case) ['low', 'high']
and some information that I would need to retain is lost.