Concatenate strings from several rows using Pandas groupby

Question

I want to merge several strings in a dataframe based on a groupedby in Pandas.

This is my code so far:

import pandas as pd
from io import StringIO

data = StringIO("""
"name1","hej","2014-11-01"
"name1","du","2014-11-02"
"name1","aj","2014-12-01"
"name1","oj","2014-12-02"
"name2","fin","2014-11-01"
"name2","katt","2014-11-02"
"name2","mycket","2014-12-01"
"name2","lite","2014-12-01"
""")

# load string as stream into dataframe
df = pd.read_csv(data,header=0, names=["name","text","date"],parse_dates=[2])

# add column with month
df["month"] = df["date"].apply(lambda x: x.month)

I want the end result to look like this:

enter image description here

I don't get how I can use groupby and apply some sort of concatenation of the strings in the column "text". Any help appreciated!

EdChum · Accepted Answer · 2017-11-10T09:48:27.393

337

You can groupby the 'name' and 'month' columns, then call transform which will return data aligned to the original df and apply a lambda where we join the text entries:

In [119]:

df['text'] = df[['name','text','month']].groupby(['name','month'])['text'].transform(lambda x: ','.join(x))
df[['name','text','month']].drop_duplicates()
Out[119]:
    name         text  month
0  name1       hej,du     11
2  name1        aj,oj     12
4  name2     fin,katt     11
6  name2  mycket,lite     12

I sub the original df by passing a list of the columns of interest df[['name','text','month']] here and then call drop_duplicates

EDIT actually I can just call apply and then reset_index:

In [124]:

df.groupby(['name','month'])['text'].apply(lambda x: ','.join(x)).reset_index()

Out[124]:
    name  month         text
0  name1     11       hej,du
1  name1     12        aj,oj
2  name2     11     fin,katt
3  name2     12  mycket,lite

update

the lambda is unnecessary here:

In[38]:
df.groupby(['name','month'])['text'].apply(','.join).reset_index()

Out[38]: 
    name  month         text
0  name1     11           du
1  name1     12        aj,oj
2  name2     11     fin,katt
3  name2     12  mycket,lite

edited Nov 10 '17 at 09:48

answered Dec 04 '14 at 15:54

EdChum

376,765
198
813
562

5

In `pandas < 1.0`, `.drop_duplicates()` ignores the index, which may give unexpected results. You can avoid this by using `.agg(lambda x: ','.join(x))` instead of `.transform().drop_duplicates()`. – Matthias Fripp May 30 '20 at 02:41
Neat and uncomplicated. Eminently fleixible also – Raghavan vmvs Sep 08 '20 at 08:53
`drop_duplicates()` might not work if you do not include parameter `drop_duplicates(inplace=True)` or just rewrite the line of code as `df = df[['name','text','month']].drop_duplicates()` – IAmBotmaker Sep 23 '20 at 11:46
What ensures that the text e.g. in the first column is actually "hej du" and not "du hej"? Is there an implicit sort somewhere? How can I make this explicit, e.g. sort by the date column? – Thomas Aug 04 '21 at 13:55
1

Why did 'hej,du' change to just 'du' in the "update" section? – constantstranger Mar 19 '22 at 23:11

score 134 · Answer 2 · edited Feb 14 '22 at 19:29

134

We can groupby the 'name' and 'month' columns, then call agg() functions of Panda’s DataFrame objects.

The aggregation functionality provided by the agg() function allows multiple statistics to be calculated per group in one calculation.

df.groupby(['name', 'month'], as_index = False).agg({'text': ' '.join})

edited Feb 14 '22 at 19:29

David Wolf

1,400
1
9
18

answered Dec 11 '19 at 10:48

Ram Prajapati

1,901
1
10
8

1

hi, any ideas for dropping duplicates with agg function ? – kağan hazal koçdemir Sep 14 '21 at 19:40
8

@kağanhazalkoçdemir `agg({'text': lambda x: ' '.join(set(x))})` – Nicolas78 Sep 28 '21 at 08:16
1

How can one use this method in a case where NULLs are allowed in the column 'text' ? – Andew Jul 21 '22 at 16:42
`f = lambda x: func(x, *args, **kwargs) TypeError: sequence item 45: expected str instance, NoneType found` on NULL or None values in the database – Andew Jul 21 '22 at 16:43
This also allows you to keep additional columns, for example by adding `, 'othercol': 'last'` into the `agg` dict – fantabolous Sep 13 '22 at 05:53

score 60 · Answer 3 · edited Feb 15 '22 at 02:07

60

The answer by EdChum provides you with a lot of flexibility but if you just want to concateate strings into a column of list objects you can also:

output_series = df.groupby(['name','month'])['text'].apply(list)

edited Feb 15 '22 at 02:07

David Wolf

1,400
1
9
18

answered Aug 28 '17 at 19:18

Rutger Hofste

4,073
3
33
44

3

Man, you've just saved me a lot of time. Thank you. This is the best way to assemble the chronological lists of registrations/user ids into 'cohorts' that I know of. Thank you once again. – Alex Fedotov Jun 28 '20 at 02:37
This solution worked for me very well for getting the unique appearances too. I just used “set” instead of “list” and then daisy chained a join and presto. Note that it doesn’t work if there are nan values, so I had to use fillna() on the text field first. In my case the command ended: df.groupby(['doc_id'])['author'].apply(set).apply(", ".join).reset_index() – whydoesntwork Apr 11 '22 at 12:52
I don't think this adds spaces between the strings does it? – Bill Apr 12 '22 at 15:44

score 17 · Answer 4 · edited Nov 25 '20 at 21:19

17

If you want to concatenate your "text" in a list:

df.groupby(['name', 'month'], as_index = False).agg({'text': list})

edited Nov 25 '20 at 21:19

theWellHopeErr

1,856
7
22

answered Nov 25 '20 at 14:46

Ismail

181
1
2

Nic Scozzaro · Answer 5 · 2021-09-08T16:52:18.277

13

For me the above solutions were close but added some unwanted /n's and dtype:object, so here's a modified version:

df.groupby(['name', 'month'])['text'].apply(lambda text: ''.join(text.to_string(index=False))).str.replace('(\\n)', '').reset_index()

edited Sep 08 '21 at 16:52

answered Jun 28 '18 at 15:00

Nic Scozzaro

6,651
3
42
46

score 6 · Answer 6 · answered Oct 28 '21 at 10:17

6

Please try this line of code : -

df.groupby(['name','month'])['text'].apply(','.join).reset_index()

answered Oct 28 '21 at 10:17

Ashish Anand

2,575
23
15

score 3 · Answer 7 · answered Mar 30 '21 at 10:12

3

Although, this is an old question. But just in case. I used the below code and it seems to work like a charm.

text = ''.join(df[df['date'].dt.month==8]['text'])

answered Mar 30 '21 at 10:12

MMSA

810
8
22

score 1 · Answer 8 · answered Dec 01 '22 at 11:55

Thanks to all the other answers, the following is probably the most concise and feels more natural. Using df.groupby("X")["A"].agg() aggregates over one or many selected columns.

df = pandas.DataFrame({'A' : ['a', 'a', 'b', 'c', 'c'],
                       'B' : ['i', 'j', 'k', 'i', 'j'],
                       'X' : [1, 2, 2, 1, 3]})

  A  B  X
  a  i  1
  a  j  2
  b  k  2
  c  i  1
  c  j  3

df.groupby("X", as_index=False)["A"].agg(' '.join)

  X    A
  1  a c
  2  a b
  3    c

df.groupby("X", as_index=False)[["A", "B"]].agg(' '.join)

  X    A    B
  1  a c  i i
  2  a b  j k
  3    c    j

Concatenate strings from several rows using Pandas groupby

8 Answers8

Linked

Related