2

Context

I have a dataframe that contains transcripts. Each row in the df has a unique ID, transcript line, and timestamp, and each ID can have multiple correspondences within the day (or span of days).


Example Code Below!


What I have:

#What I have starting out with. Df is ordered by CustomerID and Timestamp
pd.DataFrame({'AgentID': 0, 'CustomerID': 1, 'Date': ['2018-01-21', '2018-01-21', '2018-01-22', '2018-01-22'], 'Timestamp': ['2018-01-21 16:28:54', '2018-01-21 16:48:54', '2018-01-22 12:18:54', '2018-01-22 12:22:54'], 'Transcript_Line':['How can I help you?', 'I need help with this pandas problem...', 'Did you get that problem resolved?', 'Nope I still suck at pandas']})

enter image description here


What I need:

#This is the final result
 pd.DataFrame({'AgentID': 0, 'CustomerID': 1, 'Date': ['2018-01-21', '2018-01-22'], 'Transcript_Line': ['How can I help you?\nI need help with this pandas problem...', 'Did you get that problem resolved?\nNope I still suck at pandas']})

enter image description here

I need to organize and combine all transcripts (strings in each row) that correspond to the same day (in order).


This is what I have tried so far The issue is here:

def concatConvos(x):

    if len(set(x.Date)) == 1:
        return pd.Series({'Email' : x['CustomerID'].values[0], 
                        'Date': x['Date'].values[0],
                    'Conversation' : '\n'.join(x['Transcript_Line'])})
    else:  
        rows = []
        for date in set(x.Date):
            rows.append(pd.Series({'Email': x['CustomerID'].values[0],
                                 'Date': date,
                                 'Conversation': '\n'.join(x[x.Date == date].Transcript_Line)}))
        return tuple(rows)

data3 = data2.groupby('CustomerID').apply(concatConvos)

I am able to get this to work for cases where the customer only has 1 date of correspondence (meaning he did not reach out multiple times, the first case).

If I try to handle more cases than 1 then I end up with attribute errors likely because the function is returning multiple series objects.

Is there an easier way to go about this?

codebrotherone
  • 541
  • 6
  • 22

2 Answers2

1

This is not the prettiest solution, or the most efficient, but I've used something like this in the past. I'm sure there may be a more efficient solution out there instead of using loops. I'll give you the raw code and then break it down step by step:

transcript_join = df.groupby(['CustomerID', 'Date']).apply(lambda f: f['Transcript_Line'].values.tolist()).to_dict()

for x in transcript_join.keys():
    df.loc[(df['CustomerID']==x[0]) & (df['Date'] == x[1]), 'Combine'] = '\n'.join(transcript_join.get(x))

df.drop_duplicates(df.iloc[:,[0,1,2,5]])

# output below
    AgentID CustomerID  Date    Timestamp   Transcript_Line Combine
0   0   1   2018-01-21  2018-01-21 16:28:54 How can I help you? How can I help you?\nI need help with this pan...
2   0   1   2018-01-22  2018-01-22 12:18:54 Did you get that problem resolved?  Did you get that problem resolved?\nNope I sti...

First I create a dictionary of all of the responses with the variable transcript_join. The key is the Customer ID and then the Date. The Value is a list of the transcripts.

I then loop through the keys and get the location of where the Customer ID and the Date is the same in the dictionary, and use .join to combine the transcripts together in a new column.

Lastly, I drop the duplicates since there will now be duplicates becasue each Customer ID and Date pair will contain the same Combine column. I use iloc to get rid of the columns that are not needed in the output such as the original Transcript column as well as Timestamp

MattR
  • 4,887
  • 9
  • 40
  • 67
1

You should be able to accomplish this with groupby. Here's your original DataFrame. I just named it df for convenience.

df = pd.DataFrame({'AgentID': 0, 'CustomerID': 1, 'Date': ['2018-01-21', '2018-01-21', '2018-01-22', '2018-01-22'], 'Timestamp': ['2018-01-21 16:28:54', '2018-01-21 16:48:54', '2018-01-22 12:18:54', '2018-01-22 12:22:54'], 'Transcript_Line':['How can I help you?', 'I need help with this pandas problem...', 'Did you get that problem resolved?', 'Nope I still suck at pandas']})

I'm a little unclear about whether you need to sort on both AgentID and CustomerID or just one or the other, but hopefully you can see how to modify it.

The initial sort guarantees that the Transcript_Line will be placed in order. groupby then finds the set of all claims for the same AgentID and CustomerID on the same day. The as_index=False, gives you the correct formatting of columns in the output. The output you want is combining the transcript lines, which you can accomplish with sum.

df.sort_values(by=['AgentID', 'CustomerID', 
    'Timestamp']).groupby(['AgentID', 'CustomerID', 
    'Date'], as_index=False)['Transcript_Line'].sum()

If you really need the '\n' characters between them, then you can get around this by first adding them to every single transcript line, doing the same groupby as above, and then removing the character at the end of the combined string.

df['Transcript_Line'] = df['Transcript_Line'] + '\n'

grouped = df.sort_values(by=['AgentID', 'CustomerID', 
    'Timestamp']).groupby(['AgentID', 'CustomerID', 
    'Date'], as_index=False)['Transcript_Line'].sum()

grouped['Transcript_Line'] = grouped['Transcript_Line'].apply(lambda 
    x: x[:-1])

resulting grouped dataframe

ALollz
  • 57,915
  • 7
  • 66
  • 89