1

I have a dataframe and am trying to calculate the time difference between two different topics while remaining within a call and not spilling over into a new call (i.e while ensuring it's not working out the time difference between topics in different calls). Where the interaction_id is a seperate call

This is an example Dataframe

df = pd.DataFrame([[1, 2, 'Cost'], [1, 5.72, NaN], [1, 8.83, 'Billing'], [1, 12.86, NaN], [2, 2, 'Cost'], [2, 6.75, NaN], [2, 8.54, NaN], [3, 1.5, 'Payments'],[3, 3.65, 'Products']], columns=['interaction_id', 'start_time', 'topic'])

      interaction_id    start_time     topic 
           1               2           Cost
           1              5.72          NaN
           1              8.83         Billing
           1              12.86         NaN
           2               2            Cost
           2              6.75          NaN
           2              8.54          NaN
           3              1.5          Payments
           3              3.65         Products

An this is the Desired Output

df2 = pd.DataFrame([[1, 2, 'Cost',6.83], [1, 5.72, NaN, NaN], [1, 8.83, 'Billing',4.03], [1, 12.86, NaN,NaN], [2, 2, 'Cost',6.54], [2, 6.75, NaN, NaN], [2, 8.54, NaN, NaN], [3, 1.5, 'Payments', 2.15],[3, 3.65, 'Products','...']], columns=['interaction_id', 'start_time', 'topic','topic_length])

       interaction_id    start_time     topic     topic_length

           1               2           Cost           6.83
           1              5.72          NaN           NaN
           1              8.83         Billing        4.03
           1              12.86         NaN           NaN
           2               2            Cost          6.54
           2              6.75          NaN           NaN
           2              8.54          NaN           NaN
           3              1.5          Payments       2.15
           3              3.65         Products       ....

I hope that makes sense

CurtisRob
  • 35
  • 4
  • 1
    Welcome to SO. Please note that images of code/date are generally discouraged. Also, please read [this guide](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) and format your question accordingly. – Quang Hoang Feb 10 '21 at 17:32

1 Answers1

1

Can you try the following approach?

I'm applying a function to each call (interaction) and then assigning a unique number (ngroup) each topic for each call. Then I'm assigning the end of the call its own number (-1). Then I use diff to calculate the topic length.

import pandas as pd
import numpy as np
from numpy import nan as NaN
df = pd.DataFrame([[1, 2, 'Cost'], [1, 5.72, NaN], [1, 8.83, 'Billing'], [1, 12.86, NaN], [2, 2, 'Cost'], [2, 6.75, NaN], [2, 8.54, NaN], [3, 1.5, 'Payments'],[3, 3.65, 'Products']], columns=['interaction_id', 'start_time', 'topic'])
def func(df):
    ngroup_df = pd.DataFrame({"topic":df.ffill()['topic'].drop_duplicates().to_list(),"ngroup":[i for i in range(len(df.ffill()['topic'].drop_duplicates().to_list()))][::-1]})
    df = df.ffill().merge(ngroup_df)
    df.loc[df.index.max(), 'ngroup'] = -1
    length_df = df[['start_time','ngroup']].groupby('ngroup').min().diff().dropna().rename({'start_time':'length'}, axis = 1).reset_index()
    length_df['length'] = length_df['length'].abs()
    df.loc[df.index.max(), 'ngroup'] = 0
    return df.merge(length_df, how = 'left')
>>> print(df.groupby(['interaction_id']).apply(func).reset_index(drop = True))
   interaction_id  start_time     topic  ngroup  length
0               1        2.00      Cost       1    6.83
1               1        5.72      Cost       1    6.83
2               1        8.83   Billing       0    4.03
3               1       12.86   Billing       0    4.03
4               2        2.00      Cost       0    6.54
5               2        6.75      Cost       0    6.54
6               2        8.54      Cost       0    6.54
7               3        1.50  Payments       1    2.15
8               3        3.65  Products       0     NaN
Adam Zeldin
  • 898
  • 4
  • 6
  • Hi Adam, I keep getting the following error 'cannot reindex from a duplicate axis' – CurtisRob Feb 10 '21 at 21:07
  • Hi @CurtisRob, I just edited the code to include everything from import through the result. Does this resolve the import error? These are the calculations you indicated as your desired results. You can tweak the code if you don't want length repeated. – Adam Zeldin Feb 10 '21 at 21:48
  • Perfect! Thanks so much @Adam! I really appreciate it – CurtisRob Feb 10 '21 at 22:20