4

I'm using Python3 with pandas version '0.19.2'.

I have a pandas df as follows:

chat_id    line
1          'Hi.'
1          'Hi, how are you?.'
1          'I'm well, thanks.'
2          'Is it going to rain?.'
2          'No, I don't think so.'

I want to group by 'chat_id', then do something like a rolling sum on 'line' to get the following:

chat_id    line                     conversation
1          'Hi.'                    'Hi.'
1          'Hi, how are you?.'      'Hi. Hi, how are you?.'
1          'I'm well, thanks.'      'Hi. Hi, how are you?. I'm well, thanks.'
2          'Is it going to rain?.'  'Is it going to rain?.'
2          'No, I don't think so.'  'Is it going to rain?. No, I don't think so.'

I believe df.groupby('chat_id')['line'].cumsum() would only work on a numeric column.

I have also tried df.groupby(by=['chat_id'], as_index=False)['line'].apply(list) to get a list of all the lines in the full conversation, but then I can't figure out how to unpack that list to create the 'rolling sum' style conversation column.

user3591836
  • 953
  • 2
  • 16
  • 29
  • Interesting. `cumsum` works if you call it on a Series but raises an error when called on a groupby object. – ayhan Apr 23 '17 at 08:53

1 Answers1

3

For me works apply with Series.cumsum, if need separator add space:

df['new'] = df.groupby('chat_id')['line'].apply(lambda x: (x + ' ').cumsum().str.strip())
print (df)
   chat_id                   line                                          new
0        1                    Hi.                                          Hi.
1        1      Hi, how are you?.                        Hi. Hi, how are you?.
2        1      I'm well, thanks.      Hi. Hi, how are you?. I'm well, thanks.
3        2  Is it going to rain?.                        Is it going to rain?.
4        2  No, I don't think so.  Is it going to rain?. No, I don't think so.

df['line'] = df['line'].str.strip("'")
df['new'] = df.groupby('chat_id')['line'].apply(lambda x: "'" + (x + ' ').cumsum().str.strip() + "'")
print (df)
   chat_id                   line  \
0        1                    Hi.   
1        1      Hi, how are you?.   
2        1      I'm well, thanks.   
3        2  Is it going to rain?.   
4        2  No, I don't think so.   

                                             new  
0                                          'Hi.'  
1                        'Hi. Hi, how are you?.'  
2      'Hi. Hi, how are you?. I'm well, thanks.'  
3                        'Is it going to rain?.'  
4  'Is it going to rain?. No, I don't think so.' 
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • For me that results in: ValueError: cannot reindex from a duplicate axis – user3591836 Apr 23 '17 at 10:29
  • What is your version of pandas? `print (pd.show_versions())`. Because I cannot simulate your error. I test duplicates in values, duplicates in index and all works perfectly in version `0.19.2`. – jezrael Apr 23 '17 at 12:21
  • Sorry, you're right. I had to reset_index() on the df and then it worked. – user3591836 Apr 23 '17 at 23:30
  • If I have a `NaN` values in between the conversation (for example at `index 1`), how can I exclude that from `cumsum`-ing it? Thanks! – Toto Lele Jan 27 '21 at 09:08
  • 1
    @TotoLele - One idea `df['new'] = df.dropna(subset=['line']).groupby('chat_id')['line'].apply(lambda x: (x + ' ').cumsum().str.strip())` – jezrael Jan 27 '21 at 09:09