-1

I have a Pandas DataFrame shown below consisting of three columns.

import pandas as pd
data = [[1, "User1", "Hello."], [1, "User1", "How are you?"], [1, "User2", "I'm fine."], [2, "User1", "Nice to meet you."], [2, "User2", "Hello."], [2, "User2", "I'm happy."], [2, "User2", "Goodbye."], [3, "User2", "Hello."]]
df = pd.DataFrame(data, columns=['Conversation', 'User', 'Text'])
   Conversation   User              Text
0             1  User1             Hello.
1             1  User1      How are you?
2             1  User2         I'm fine.
3             2  User1  Nice to meet you.
4             2  User2             Hello.
5             2  User2         I'm happy.
6             2  User2           Goodbye.
7             3  User2             Hello.

I would like to merge the Text of groups of consecutive Users, but not over conversation boundaries. If in a Conversation a User has several consecutive rows, I would like to merge these rows into one row by combining the Text with whitespace. When a new Conversation starts, it should not be combined. For the example, the result should look as follows:

   Conversation   User              Text
0             1  User1             Hello. How are you?
2             1  User2         I'm fine.
3             2  User1  Nice to meet you.
4             2  User2             Hello. I'm happy. Goodbye.
7             3  User2             Hello.

How can this be achieved in an efficient way (I have a big DataFrame)?

wjandrea
  • 28,235
  • 9
  • 60
  • 81
BlackHawk
  • 719
  • 1
  • 6
  • 18
  • Very similar question: [Conditionally merge consecutive rows of a pandas dataframe](/q/63853639/4518341) – wjandrea Dec 31 '22 at 02:05
  • What have you already tried? If you know how to use `.groupby` and `.agg` already, then the only thing you'd be missing would be grouping consecutive `User`s, and there's an existing question about basically that: [How to groupby consecutive values in pandas DataFrame](/q/40802800/4518341). Does that answer your question? – wjandrea Dec 31 '22 at 02:08
  • BTW, please read [ask] for tips like how to write a good title. – wjandrea Dec 31 '22 at 02:09
  • That might be a bad example since there are no non-consecutive `User`s within each `Conversation`. – wjandrea Dec 31 '22 at 02:10
  • is `df.groupby(['Conversation','User'])['Text'].agg(' '.join).reset_index()` what you are looking for? – rhug123 Dec 31 '22 at 05:15

1 Answers1

1

I've changed the example to include the conditions that were specified. One way to do this is to introduce a temporary boundary column, and then group by the boundary, conversation and user. If so desired the boundary column can be removed.

import pandas as pd

data = [
    [1, "User1", "Hello."], 
    [1, "User2", "How are you?"], 
    [1, "User1", "I'm fine."], 
    [2, "User1", "Nice to meet you."], 
    [2, "User2", "Hello."], 
    [2, "User2", "I'm happy."], 
    [2, "User2", "Goodbye."], 
    [1, "User1", "Bye now."]]
df = pd.DataFrame(data, columns=['Conversation', 'User', 'Text'])


df['Boundary'] = ((df['Conversation'].astype(str) + df['User']) != (df['Conversation'].astype(str) + df['User']).shift()).cumsum()
print(df.groupby(['Boundary','Conversation', 'User'], as_index=False, sort=True).agg(' '.join).drop(columns=['Boundary']))

Output:

   Conversation   User                        Text
0             1  User1                      Hello.
1             1  User2                How are you?
2             1  User1                   I'm fine.
3             2  User1           Nice to meet you.
4             2  User2  Hello. I'm happy. Goodbye.
5             1  User1                    Bye now.
bn_ln
  • 1,648
  • 1
  • 6
  • 13