How to use apply function in pandas to return multiple rows?

Question

Context

I have a dataframe that contains transcripts. Each row in the df has a unique ID, transcript line, and timestamp, and each ID can have multiple correspondences within the day (or span of days).

Example Code Below!

What I have:

#What I have starting out with. Df is ordered by CustomerID and Timestamp
pd.DataFrame({'AgentID': 0, 'CustomerID': 1, 'Date': ['2018-01-21', '2018-01-21', '2018-01-22', '2018-01-22'], 'Timestamp': ['2018-01-21 16:28:54', '2018-01-21 16:48:54', '2018-01-22 12:18:54', '2018-01-22 12:22:54'], 'Transcript_Line':['How can I help you?', 'I need help with this pandas problem...', 'Did you get that problem resolved?', 'Nope I still suck at pandas']})

What I need:

#This is the final result
 pd.DataFrame({'AgentID': 0, 'CustomerID': 1, 'Date': ['2018-01-21', '2018-01-22'], 'Transcript_Line': ['How can I help you?\nI need help with this pandas problem...', 'Did you get that problem resolved?\nNope I still suck at pandas']})

I need to organize and combine all transcripts (strings in each row) that correspond to the same day (in order).

This is what I have tried so far The issue is here:

def concatConvos(x):

    if len(set(x.Date)) == 1:
        return pd.Series({'Email' : x['CustomerID'].values[0], 
                        'Date': x['Date'].values[0],
                    'Conversation' : '\n'.join(x['Transcript_Line'])})
    else:  
        rows = []
        for date in set(x.Date):
            rows.append(pd.Series({'Email': x['CustomerID'].values[0],
                                 'Date': date,
                                 'Conversation': '\n'.join(x[x.Date == date].Transcript_Line)}))
        return tuple(rows)

data3 = data2.groupby('CustomerID').apply(concatConvos)

I am able to get this to work for cases where the customer only has 1 date of correspondence (meaning he did not reach out multiple times, the first case).

If I try to handle more cases than 1 then I end up with attribute errors likely because the function is returning multiple series objects.

Is there an easier way to go about this?

would it be possible for you to share a sample piece of your dataframe that shows your issue and your expected output? There is a lot of text and it may be difficult for some (also me) to follow along — MattR, Feb 21 '18 at 16:35
Yes...give me a second... There is PII info I need to filter out. — codebrotherone, Feb 21 '18 at 16:47
try not using a picture and paste the text instead. Read [this](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) and it may help :) — MattR, Feb 21 '18 at 16:55
That was a great edit, continue to post questions like these! If you post it this way the first time, you are more likely to get solid answers :) — MattR, Feb 21 '18 at 18:28

MattR · Answer 1 · 2018-02-21T18:26:32.520

This is not the prettiest solution, or the most efficient, but I've used something like this in the past. I'm sure there may be a more efficient solution out there instead of using loops. I'll give you the raw code and then break it down step by step:

transcript_join = df.groupby(['CustomerID', 'Date']).apply(lambda f: f['Transcript_Line'].values.tolist()).to_dict()

for x in transcript_join.keys():
    df.loc[(df['CustomerID']==x[0]) & (df['Date'] == x[1]), 'Combine'] = '\n'.join(transcript_join.get(x))

df.drop_duplicates(df.iloc[:,[0,1,2,5]])

# output below
    AgentID CustomerID  Date    Timestamp   Transcript_Line Combine
0   0   1   2018-01-21  2018-01-21 16:28:54 How can I help you? How can I help you?\nI need help with this pan...
2   0   1   2018-01-22  2018-01-22 12:18:54 Did you get that problem resolved?  Did you get that problem resolved?\nNope I sti...

First I create a dictionary of all of the responses with the variable transcript_join. The key is the Customer ID and then the Date. The Value is a list of the transcripts.

I then loop through the keys and get the location of where the Customer ID and the Date is the same in the dictionary, and use .join to combine the transcripts together in a new column.

Lastly, I drop the duplicates since there will now be duplicates becasue each Customer ID and Date pair will contain the same Combine column. I use iloc to get rid of the columns that are not needed in the output such as the original Transcript column as well as Timestamp

Honestly... If someone posts another solution to this, I'd greatly appreciate it because it would make one of my processes I work on much faster — MattR, Feb 21 '18 at 18:27
I feel like the groupby method should work, and it should be able to be manipulated via apply to get our results — codebrotherone, Feb 21 '18 at 19:10
unfortunately I am still working on implementing it. The dataset is at about half a million records so I am trying to parse everything in chunks — codebrotherone, Feb 21 '18 at 19:32

ALollz · Answer 2 · 2018-03-14T21:07:41.397

You should be able to accomplish this with groupby. Here's your original DataFrame. I just named it df for convenience.

df = pd.DataFrame({'AgentID': 0, 'CustomerID': 1, 'Date': ['2018-01-21', '2018-01-21', '2018-01-22', '2018-01-22'], 'Timestamp': ['2018-01-21 16:28:54', '2018-01-21 16:48:54', '2018-01-22 12:18:54', '2018-01-22 12:22:54'], 'Transcript_Line':['How can I help you?', 'I need help with this pandas problem...', 'Did you get that problem resolved?', 'Nope I still suck at pandas']})

I'm a little unclear about whether you need to sort on both AgentID and CustomerID or just one or the other, but hopefully you can see how to modify it.

The initial sort guarantees that the Transcript_Line will be placed in order. groupby then finds the set of all claims for the same AgentID and CustomerID on the same day. The as_index=False, gives you the correct formatting of columns in the output. The output you want is combining the transcript lines, which you can accomplish with sum.

df.sort_values(by=['AgentID', 'CustomerID', 
    'Timestamp']).groupby(['AgentID', 'CustomerID', 
    'Date'], as_index=False)['Transcript_Line'].sum()

If you really need the '\n' characters between them, then you can get around this by first adding them to every single transcript line, doing the same groupby as above, and then removing the character at the end of the combined string.

df['Transcript_Line'] = df['Transcript_Line'] + '\n'

grouped = df.sort_values(by=['AgentID', 'CustomerID', 
    'Timestamp']).groupby(['AgentID', 'CustomerID', 
    'Date'], as_index=False)['Transcript_Line'].sum()

grouped['Transcript_Line'] = grouped['Transcript_Line'].apply(lambda 
    x: x[:-1])

resulting grouped dataframe

How to use apply function in pandas to return multiple rows?

2 Answers2