Pandas groupby: How to get a union of strings

Question

I have a dataframe like this:

   A         B       C
0  1  0.749065    This
1  2  0.301084      is
2  3  0.463468       a
3  4  0.643961  random
4  1  0.866521  string
5  2  0.120737       !

Calling

In [10]: print df.groupby("A")["B"].sum()

will return

A
1    1.615586
2    0.421821
3    0.463468
4    0.643961

Now I would like to do "the same" for column "C". Because that column contains strings, sum() doesn't work (although you might think that it would concatenate the strings). What I would really like to see is a list or set of the strings for each group, i.e.

A
1    {This, string}
2    {is, !}
3    {a}
4    {random}

I have been trying to find ways to do this.

Series.unique() (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.unique.html) doesn't work, although

df.groupby("A")["B"]

is a

pandas.core.groupby.SeriesGroupBy object

so I was hoping any Series method would work. Any ideas?

score 199 · Accepted Answer · edited Sep 10 '19 at 07:44

In [4]: df = read_csv(StringIO(data),sep='\s+')

In [5]: df
Out[5]: 
   A         B       C
0  1  0.749065    This
1  2  0.301084      is
2  3  0.463468       a
3  4  0.643961  random
4  1  0.866521  string
5  2  0.120737       !

In [6]: df.dtypes
Out[6]: 
A      int64
B    float64
C     object
dtype: object

When you apply your own function, there is not automatic exclusions of non-numeric columns. This is slower, though, than the application of .sum() to the groupby

In [8]: df.groupby('A').apply(lambda x: x.sum())
Out[8]: 
   A         B           C
A                         
1  2  1.615586  Thisstring
2  4  0.421821         is!
3  3  0.463468           a
4  4  0.643961      random

sum by default concatenates

In [9]: df.groupby('A')['C'].apply(lambda x: x.sum())
Out[9]: 
A
1    Thisstring
2           is!
3             a
4        random
dtype: object

You can do pretty much what you want

In [11]: df.groupby('A')['C'].apply(lambda x: "{%s}" % ', '.join(x))
Out[11]: 
A
1    {This, string}
2           {is, !}
3               {a}
4          {random}
dtype: object

Doing this on a whole frame, one group at a time. Key is to return a Series

def f(x):
     return Series(dict(A = x['A'].sum(), 
                        B = x['B'].sum(), 
                        C = "{%s}" % ', '.join(x['C'])))

In [14]: df.groupby('A').apply(f)
Out[14]: 
   A         B               C
A                             
1  2  1.615586  {This, string}
2  4  0.421821         {is, !}
3  3  0.463468             {a}
4  4  0.643961        {random}

It seems these operations are now vectorised removing the need for `apply` and `lambda`s. I came here wondering why `pandas` actually concats and not return an error on summing strings. — NelsonGon, Sep 10 '19 at 07:42
If you are trying to concat strings and add a character in between, the .agg solution recommended by @voithos below is much faster than the .apply recommended here. In my testing I was getting 5-10x faster. — Doubledown, Sep 12 '19 at 00:02

score 83 · Answer 2 · answered Jul 24 '13 at 17:51

83

You can use the apply method to apply an arbitrary function to the grouped data. So if you want a set, apply set. If you want a list, apply list.

>>> d
   A       B
0  1    This
1  2      is
2  3       a
3  4  random
4  1  string
5  2       !
>>> d.groupby('A')['B'].apply(list)
A
1    [This, string]
2           [is, !]
3               [a]
4          [random]
dtype: object

If you want something else, just write a function that does what you want and then apply that.

answered Jul 24 '13 at 17:51

BrenBarn

242,874
37
412
384

Working fine, but Column A is missing. – Vineesh TP Oct 08 '19 at 21:16
6

@VineeshTP: Column A was used as the grouping column, so it's in the index, as you can see in the example. You could get it back out as a column by using `.reset_index()`. – BrenBarn Oct 09 '19 at 03:57

score 36 · Answer 3 · answered Jul 24 '13 at 17:52

36

You may be able to use the aggregate (or agg) function to concatenate the values. (Untested code)

df.groupby('A')['B'].agg(lambda col: ''.join(col))

answered Jul 24 '13 at 17:52

voithos

68,482
12
101
116

1

It really works. Amazing. As @voithos mentioned "untested", I was not very optimistic. Bit I tested his version as an entry in an agg dictionary and it worked as intended: .agg({'tp': 'sum', 'BaseWgt': 'max','TP_short':lambda col: ', '.join(col)}) Made my day – matthhias Jul 06 '18 at 16:36
4

If you are trying to concat strings together with some type of separator, I've found this .agg suggestion to be much faster than .apply. For a dataset of 600k+ text strings, I got identical results 5-10x faster. – Doubledown Sep 12 '19 at 00:00

score 17 · Answer 4 · edited Feb 21 '19 at 15:33

17

You could try this:

df.groupby('A').agg({'B':'sum','C':'-'.join})

edited Feb 21 '19 at 15:33

luis.galdo

543
5
20

answered Oct 15 '18 at 12:35

user3241146

171
1
2

2

From review: could you please add more explanation to your answer? – toti08 Oct 15 '18 at 12:41
2

Groupby is applied on column 'A' and with agg function i could use different functions on different columns say sum the elements in column 'C' , concatenate the elements in column 'C' while inserting a ' - ' between the words – user3241146 Oct 16 '18 at 07:11

score 16 · Answer 5 · answered Aug 23 '19 at 10:06

Named aggregations with `pandas >= 0.25.0`

Since pandas version 0.25.0 we have named aggregations where we can groupby, aggregate and at the same time assign new names to our columns. This way we won't get the MultiIndex columns, and the column names make more sense given the data they contain:

aggregate and get a list of strings

grp = df.groupby('A').agg(B_sum=('B','sum'),
                          C=('C', list)).reset_index()

print(grp)
   A     B_sum               C
0  1  1.615586  [This, string]
1  2  0.421821         [is, !]
2  3  0.463468             [a]
3  4  0.643961        [random]

aggregate and join the strings

grp = df.groupby('A').agg(B_sum=('B','sum'),
                          C=('C', ', '.join)).reset_index()

print(grp)
   A     B_sum             C
0  1  1.615586  This, string
1  2  0.421821         is, !
2  3  0.463468             a
3  4  0.643961        random

UserYmY · Answer 6 · 2016-10-03T13:11:42.593

10

a simple solution would be :

>>> df.groupby(['A','B']).c.unique().reset_index()

edited Oct 03 '16 at 13:11

answered Jul 29 '16 at 13:21

UserYmY

8,034
17
57
71

1

If in case someone is interested in joining the contents of the list into a string `df.groupby(['A','B']).c.unique().apply(lambda x: ';'.join(x)).reset_index()` – Vivek-Ananth Aug 29 '18 at 14:56

score 6 · Answer 7 · answered Oct 26 '18 at 10:09

6

If you'd like to overwrite column B in the dataframe, this should work:

    df = df.groupby('A',as_index=False).agg(lambda x:'\n'.join(x))

answered Oct 26 '18 at 10:09

Amit

307
3
6

score 3 · Answer 8 · answered Oct 09 '19 at 13:04

3

Following @Erfan's good answer, most of the times in an analysis of aggregate values you want the unique possible combinations of these existing character values:

unique_chars = lambda x: ', '.join(x.unique())
(df
 .groupby(['A'])
 .agg({'C': unique_chars}))

answered Oct 09 '19 at 13:04

Paul Rougieux

10,289
4
68
110

To confirm my understanding, `.unique()` used within lambda is determining that the Series to contain all unique values is within the specified group by values? If so, is pandas temporarily storing each of the unique values (per group by value) somewhere outside of the `unique_chars` variable to determine what values are in fact unique before ultimately assigning the values to the variable? Maybe worth a new SO question. – deesolie Mar 05 '21 at 18:14
@deesolie `unique_chars` is a function of the vector x. Lambda is a short name for creating a one line function in python. The function is then used with each column vector as an input. – Paul Rougieux Mar 08 '21 at 16:18
I understand lambda functions. Let's say we had `df.groupby(['A'])['B'].transform(lambda x: ','.join(x.unique()))`, I am curious as to how pandas is temporarily storing each of the values in the group by series to check if the proceeding value is already in the joined string or not. Again, this may be a larger question and thus should be asked as a new SO question. – deesolie Mar 09 '21 at 17:13
1

@deesolie No pandas doesn't check if the value is already in the joined string. You got the order wrong. This is a vectorized operation. First, unique returns a vector of unique characters. Second, `.join` joins all the elements of the vector together. – Paul Rougieux Mar 13 '21 at 00:08
Thanks Paul, looking at the order of operations in that way is much easier to understand how this line is functioning. – deesolie Mar 14 '21 at 01:23

Pandas groupby: How to get a union of strings

8 Answers8

Named aggregations with `pandas >= 0.25.0`

Linked

Related

Pandas groupby: How to get a union of strings

8 Answers8

Named aggregations with pandas >= 0.25.0

Linked

Related

Named aggregations with `pandas >= 0.25.0`