2

I am failly new with Pandas and I have a question about how to properly merge rows within my pandas datafame, based on the values of its columns.

I have a dataframe, df1, which contains the following:

df1 = pd.DataFrame(data=[["shoe", 4, 6, 0.110, 10],
                        ["bracelet", 22, 25, 0.115, 40],
                       ["belt", 2, 5, 0.120, 12],
                       ["socks", 1, 3, 0.422, 8],
                       ["scarf", 10, 12, 0.630, 6],
                       ["skirt", 4, 6, 0.9, 5],
                       ["bag", 9, 13, 1.131, 4],
                       ["watch", 1, 4, 1.8, 5],
                       ["sweater", 4, 5, 5.5, 14],
                       ["jeans", 1, 6, 5.6, 14]],
                    columns=['item','low','high','time', 'price'])
df1
    item         low    high    time    price
0   shoe         4      6       0.110   10
1   bracelet     22     25      0.115   40
2   belt         2      5       0.120   12
3   socks        1      3       0.422   8
4   scarf        10     12      0.630   6
5   skirt        4      6       0.9     5
6   bag          9      13      1.131   4
7   watch        1      4       1.8     5
8   sweater      4      5       5.500   14
9   jeans        1      6       5.600   14

I would like to merge these rows based on the values in ['low', 'high', 'time', 'price'] and keep track of item. I would like to merge rows where consecutive rows are less than 1 (e.g. seconds) apart based on the time column. Based on this criteria I want to merge rows 0 -- 7. But, I only want actually merge them when there is an overlap between the range of ['low', 'high']. Based on this socks, belt, shoe, skirt and watch get merged together because they have ranges (1,3), (2,5), (4,6), (4,6), (1,4) and bracelet does not get merged with other rows.

Of the rows that are about to be merge together, I would like to keep the item, time and price for the row that had the highest price. I would also like to update low and high such that it fully the covers the entire range of the merged rows. Finally, I would like two new columns time_min and time_max that will indicate the total time range of the merge columns. If rows that are to be merged have the same price, take the one with the lowest time (there will not be rows with exactly the same time and price).

Based on these rules I would expect the following dataframe, df2:

df2 = pd.DataFrame(data=[["belt", 2, 6, 0.120, 12, 0.110, 0.422],
                         ["bracelet", 22, 25, 0.115, 10, 0.115, 0.115],
                         ["scarf", 9, 13, 0.130, 6, 0.630, 1.131],
                         ["sweater", 1, 6, 5.5, 14, 5.5, 5.6]],
                   columns=['item','low','high','time', 'price', 'time_min', 'time_max'])
df2
    item        low high    time    price   time_min    time_max
0   belt        2   6       0.120   12      0.110       1.8
1   bracelet    22  25      0.115   10      0.115       0.115
2   scarf       9   13      0.130   6       0.630       1.131
3   sweater     1   6       5.500   14      5.500       5.600

I have looked at the Q&A as provided by: merge rows pandas dataframe based on condition, but in that question they only merge on (in this case) ['low', 'high'] and some information that I would need to retain is lost.

scientist
  • 137
  • 6
  • What does _where consecutive rows are less than 1 (e.g. seconds) apart_ exactly mean? E.g. you could have consecutive `time`s like `0.5, 1.4, 2.3`: Would they meet the requirement? – Timus Nov 28 '21 at 11:21
  • Hi Timus, yes that would meet the requirement. I have edited the example dataframe which now better illustrates such a scenario. – scientist Nov 28 '21 at 11:43

1 Answers1

1

Here's an attempt:

def consolidate(sdf):
    low_high_sorted = sdf[['low', 'high']].sort_values(['low', 'high'])
    grouping = pd.Series(0, index=low_high_sorted.index)
    group, group_max = 0, low_high_sorted.high.iat[0]
    for i, low, high in low_high_sorted.iloc[1:, :].itertuples():
        if low <= group_max:
            group_max = max(group_max, high)
        else:
            group += 1
            group_max = high
        grouping.at[i] = group

    return sdf.groupby(grouping).agg(
        i_max=('price', lambda x: x.idxmax()),
        low=('low', 'min'), high=('high', 'max'), price=('price', 'max'),
        time_min=('time', 'min'), time_max=('time', 'max') 
    )

df2 = (
    df1.groupby(((df1.time - df1.time.shift()) >= 1.).cumsum())
       .apply(consolidate)
       .merge(df1.item, left_on='i_max', right_index=True, how='left')
       .drop(columns='i_max')
       .reset_index(drop=True)
)

Result:

   low  high  price  time_min  time_max      item
0    1     6     12     0.110     1.800      belt
1    9    13      6     0.630     1.131     scarf
2   22    25     40     0.115     0.115  bracelet
3    1     6     14     5.500     5.600   sweater

Unfortunately, I wasn't able to build the grouping series in consolidate by Pandas methods.


1. Step: Grouping df1 to get the groups determined by the time column. Checking if time difference between time and time in the row before is greater or equal than 1. and then cumsumming over the resulting column (False/True = 0/ 1):

    time  time_diff_ge_1  cumsum
0  0.110           False       0
1  0.115           False       0
2  0.120           False       0
3  0.422           False       0
4  0.630           False       0
5  0.900           False       0
6  1.131           False       0
7  1.800           False       0
8  5.500            True       1
9  5.600           False       1

(I should have used df1.time.diff() instead of df1.time - df1.time.shift().)

2. Step: .apply-ing consolidate over the first grouping. Inside of it second grouping over the connected intervalls. To find the connections: Sorting low- and high-column to allow for a linear processing afterwards:

   low  high
3    1     3
7    1     4
2    2     5
0    4     6
5    4     6
6    9    13
4   10    12
1   22    25

(Actually, sorting only over low would be enough: sdf[['low', 'high']].sort_values('low').) For finding the connection-groups it's now enough to check if low is less or equal to the the current groups high-max. If that's not the case a new connection-group starts.

   low  high  grouping
3    1     3         0
7    1     4         0
2    2     5         0
0    4     6         0
5    4     6         0
6    9    13         1
4   10    12         1
1   22    25         2

Using a pd.Series as grouping has a nice effect: The indices will be aligned, which essentially makes it a mapping index -> grouping value, so the order of the indeces doesn't matter. See here:

... If a dict or Series is passed, the Series or dict VALUES will be used to determine the groups (the Series’ values are first aligned; see .align() method).

3. Step: Aggregating over the resulting groups into one line by .agg. agg-logic:

new_column_name=(used_column, aggregating_function)

Small detour to get the item that corresponds to the highest price: Fetching here only the corresponding index i_max via .idxmax. (methodcaller would be an alternative to the lambda-function: methodcaller('idxmax').)

4. Step: .merge-ing the result along i_max-index with df1.item to get the items that correspond to the i_maxes, and then dropping the i_max column afterwards.

Timus
  • 10,974
  • 5
  • 14
  • 28
  • 1
    Thank you very much Timus! I think this is an excellent attempt. – scientist Nov 30 '21 at 12:44
  • @scientist Thanks! I'm still thinking about the grouping ... :) – Timus Nov 30 '21 at 15:32
  • Hi Timus, if you have the time and are willing to help me, could you explain the code you posted or add some comments to it. I think I know what you are doing, but would be good to make sure I for 100% know what you are doing. – scientist Dec 02 '21 at 09:53
  • @scientist Will do, this evening. (Should have done that in the first place, it's bad style to answer without explanations.) – Timus Dec 02 '21 at 10:01
  • @scientist I've added some explanations. Please let me know if it isn't clear enough and/or you find an error! – Timus Dec 02 '21 at 20:10
  • Hi Timus, thank you very much for the explanation. I totally understand what happens now. So far I have not found any errors using my actual data. Thanks again! – scientist Dec 06 '21 at 11:40