How can I workout the difference between two values in a column while remaining in the bounds of another column?

Question

I have a dataframe and am trying to calculate the time difference between two different topics while remaining within a call and not spilling over into a new call (i.e while ensuring it's not working out the time difference between topics in different calls). Where the interaction_id is a seperate call

This is an example Dataframe

df = pd.DataFrame([[1, 2, 'Cost'], [1, 5.72, NaN], [1, 8.83, 'Billing'], [1, 12.86, NaN], [2, 2, 'Cost'], [2, 6.75, NaN], [2, 8.54, NaN], [3, 1.5, 'Payments'],[3, 3.65, 'Products']], columns=['interaction_id', 'start_time', 'topic'])

      interaction_id    start_time     topic 
           1               2           Cost
           1              5.72          NaN
           1              8.83         Billing
           1              12.86         NaN
           2               2            Cost
           2              6.75          NaN
           2              8.54          NaN
           3              1.5          Payments
           3              3.65         Products

An this is the Desired Output

df2 = pd.DataFrame([[1, 2, 'Cost',6.83], [1, 5.72, NaN, NaN], [1, 8.83, 'Billing',4.03], [1, 12.86, NaN,NaN], [2, 2, 'Cost',6.54], [2, 6.75, NaN, NaN], [2, 8.54, NaN, NaN], [3, 1.5, 'Payments', 2.15],[3, 3.65, 'Products','...']], columns=['interaction_id', 'start_time', 'topic','topic_length])

       interaction_id    start_time     topic     topic_length

           1               2           Cost           6.83
           1              5.72          NaN           NaN
           1              8.83         Billing        4.03
           1              12.86         NaN           NaN
           2               2            Cost          6.54
           2              6.75          NaN           NaN
           2              8.54          NaN           NaN
           3              1.5          Payments       2.15
           3              3.65         Products       ....

I hope that makes sense

Welcome to SO. Please note that images of code/date are generally discouraged. Also, please read [this guide](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) and format your question accordingly. — Quang Hoang, Feb 10 '21 at 17:32

Adam Zeldin · Accepted Answer · 2021-02-10T21:47:15.620

Can you try the following approach?

I'm applying a function to each call (interaction) and then assigning a unique number (ngroup) each topic for each call. Then I'm assigning the end of the call its own number (-1). Then I use diff to calculate the topic length.

import pandas as pd
import numpy as np
from numpy import nan as NaN
df = pd.DataFrame([[1, 2, 'Cost'], [1, 5.72, NaN], [1, 8.83, 'Billing'], [1, 12.86, NaN], [2, 2, 'Cost'], [2, 6.75, NaN], [2, 8.54, NaN], [3, 1.5, 'Payments'],[3, 3.65, 'Products']], columns=['interaction_id', 'start_time', 'topic'])
def func(df):
    ngroup_df = pd.DataFrame({"topic":df.ffill()['topic'].drop_duplicates().to_list(),"ngroup":[i for i in range(len(df.ffill()['topic'].drop_duplicates().to_list()))][::-1]})
    df = df.ffill().merge(ngroup_df)
    df.loc[df.index.max(), 'ngroup'] = -1
    length_df = df[['start_time','ngroup']].groupby('ngroup').min().diff().dropna().rename({'start_time':'length'}, axis = 1).reset_index()
    length_df['length'] = length_df['length'].abs()
    df.loc[df.index.max(), 'ngroup'] = 0
    return df.merge(length_df, how = 'left')
>>> print(df.groupby(['interaction_id']).apply(func).reset_index(drop = True))
   interaction_id  start_time     topic  ngroup  length
0               1        2.00      Cost       1    6.83
1               1        5.72      Cost       1    6.83
2               1        8.83   Billing       0    4.03
3               1       12.86   Billing       0    4.03
4               2        2.00      Cost       0    6.54
5               2        6.75      Cost       0    6.54
6               2        8.54      Cost       0    6.54
7               3        1.50  Payments       1    2.15
8               3        3.65  Products       0     NaN

Hi Adam, I keep getting the following error 'cannot reindex from a duplicate axis' — CurtisRob, Feb 10 '21 at 21:07
Hi @CurtisRob, I just edited the code to include everything from import through the result. Does this resolve the import error? These are the calculations you indicated as your desired results. You can tweak the code if you don't want length repeated. — Adam Zeldin, Feb 10 '21 at 21:48

How can I workout the difference between two values in a column while remaining in the bounds of another column?

1 Answers1