140

I have a dataframe like this:

   A         B       C
0  1  0.749065    This
1  2  0.301084      is
2  3  0.463468       a
3  4  0.643961  random
4  1  0.866521  string
5  2  0.120737       !

Calling

In [10]: print df.groupby("A")["B"].sum()

will return

A
1    1.615586
2    0.421821
3    0.463468
4    0.643961

Now I would like to do "the same" for column "C". Because that column contains strings, sum() doesn't work (although you might think that it would concatenate the strings). What I would really like to see is a list or set of the strings for each group, i.e.

A
1    {This, string}
2    {is, !}
3    {a}
4    {random}

I have been trying to find ways to do this.

Series.unique() (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.unique.html) doesn't work, although

df.groupby("A")["B"]

is a

pandas.core.groupby.SeriesGroupBy object

so I was hoping any Series method would work. Any ideas?

cs95
  • 379,657
  • 97
  • 704
  • 746
Anne
  • 6,752
  • 8
  • 33
  • 50

8 Answers8

199
In [4]: df = read_csv(StringIO(data),sep='\s+')

In [5]: df
Out[5]: 
   A         B       C
0  1  0.749065    This
1  2  0.301084      is
2  3  0.463468       a
3  4  0.643961  random
4  1  0.866521  string
5  2  0.120737       !

In [6]: df.dtypes
Out[6]: 
A      int64
B    float64
C     object
dtype: object

When you apply your own function, there is not automatic exclusions of non-numeric columns. This is slower, though, than the application of .sum() to the groupby

In [8]: df.groupby('A').apply(lambda x: x.sum())
Out[8]: 
   A         B           C
A                         
1  2  1.615586  Thisstring
2  4  0.421821         is!
3  3  0.463468           a
4  4  0.643961      random

sum by default concatenates

In [9]: df.groupby('A')['C'].apply(lambda x: x.sum())
Out[9]: 
A
1    Thisstring
2           is!
3             a
4        random
dtype: object

You can do pretty much what you want

In [11]: df.groupby('A')['C'].apply(lambda x: "{%s}" % ', '.join(x))
Out[11]: 
A
1    {This, string}
2           {is, !}
3               {a}
4          {random}
dtype: object

Doing this on a whole frame, one group at a time. Key is to return a Series

def f(x):
     return Series(dict(A = x['A'].sum(), 
                        B = x['B'].sum(), 
                        C = "{%s}" % ', '.join(x['C'])))

In [14]: df.groupby('A').apply(f)
Out[14]: 
   A         B               C
A                             
1  2  1.615586  {This, string}
2  4  0.421821         {is, !}
3  3  0.463468             {a}
4  4  0.643961        {random}
NelsonGon
  • 13,015
  • 7
  • 27
  • 57
Jeff
  • 125,376
  • 21
  • 220
  • 187
  • 2
    It seems these operations are now vectorised removing the need for `apply` and `lambda`s. I came here wondering why `pandas` actually concats and not return an error on summing strings. – NelsonGon Sep 10 '19 at 07:42
  • 1
    If you are trying to concat strings and add a character in between, the .agg solution recommended by @voithos below is much faster than the .apply recommended here. In my testing I was getting 5-10x faster. – Doubledown Sep 12 '19 at 00:02
83

You can use the apply method to apply an arbitrary function to the grouped data. So if you want a set, apply set. If you want a list, apply list.

>>> d
   A       B
0  1    This
1  2      is
2  3       a
3  4  random
4  1  string
5  2       !
>>> d.groupby('A')['B'].apply(list)
A
1    [This, string]
2           [is, !]
3               [a]
4          [random]
dtype: object

If you want something else, just write a function that does what you want and then apply that.

BrenBarn
  • 242,874
  • 37
  • 412
  • 384
  • Working fine, but Column A is missing. – Vineesh TP Oct 08 '19 at 21:16
  • 6
    @VineeshTP: Column A was used as the grouping column, so it's in the index, as you can see in the example. You could get it back out as a column by using `.reset_index()`. – BrenBarn Oct 09 '19 at 03:57
36

You may be able to use the aggregate (or agg) function to concatenate the values. (Untested code)

df.groupby('A')['B'].agg(lambda col: ''.join(col))
voithos
  • 68,482
  • 12
  • 101
  • 116
  • 1
    It really works. Amazing. As @voithos mentioned "untested", I was not very optimistic. Bit I tested his version as an entry in an agg dictionary and it worked as intended: .agg({'tp': 'sum', 'BaseWgt': 'max','TP_short':lambda col: ', '.join(col)}) Made my day – matthhias Jul 06 '18 at 16:36
  • 4
    If you are trying to concat strings together with some type of separator, I've found this .agg suggestion to be much faster than .apply. For a dataset of 600k+ text strings, I got identical results 5-10x faster. – Doubledown Sep 12 '19 at 00:00
17

You could try this:

df.groupby('A').agg({'B':'sum','C':'-'.join})
luis.galdo
  • 543
  • 5
  • 20
user3241146
  • 171
  • 1
  • 2
  • 2
    From review: could you please add more explanation to your answer? – toti08 Oct 15 '18 at 12:41
  • 2
    Groupby is applied on column 'A' and with agg function i could use different functions on different columns say sum the elements in column 'C' , concatenate the elements in column 'C' while inserting a ' - ' between the words – user3241146 Oct 16 '18 at 07:11
16

Named aggregations with pandas >= 0.25.0

Since pandas version 0.25.0 we have named aggregations where we can groupby, aggregate and at the same time assign new names to our columns. This way we won't get the MultiIndex columns, and the column names make more sense given the data they contain:


aggregate and get a list of strings

grp = df.groupby('A').agg(B_sum=('B','sum'),
                          C=('C', list)).reset_index()

print(grp)
   A     B_sum               C
0  1  1.615586  [This, string]
1  2  0.421821         [is, !]
2  3  0.463468             [a]
3  4  0.643961        [random]

aggregate and join the strings

grp = df.groupby('A').agg(B_sum=('B','sum'),
                          C=('C', ', '.join)).reset_index()

print(grp)
   A     B_sum             C
0  1  1.615586  This, string
1  2  0.421821         is, !
2  3  0.463468             a
3  4  0.643961        random
Erfan
  • 40,971
  • 8
  • 66
  • 78
10

a simple solution would be :

>>> df.groupby(['A','B']).c.unique().reset_index()
UserYmY
  • 8,034
  • 17
  • 57
  • 71
  • 1
    If in case someone is interested in joining the contents of the list into a string `df.groupby(['A','B']).c.unique().apply(lambda x: ';'.join(x)).reset_index()` – Vivek-Ananth Aug 29 '18 at 14:56
6

If you'd like to overwrite column B in the dataframe, this should work:

    df = df.groupby('A',as_index=False).agg(lambda x:'\n'.join(x))
Amit
  • 307
  • 3
  • 6
3

Following @Erfan's good answer, most of the times in an analysis of aggregate values you want the unique possible combinations of these existing character values:

unique_chars = lambda x: ', '.join(x.unique())
(df
 .groupby(['A'])
 .agg({'C': unique_chars}))
Paul Rougieux
  • 10,289
  • 4
  • 68
  • 110
  • To confirm my understanding, `.unique()` used within lambda is determining that the Series to contain all unique values is within the specified group by values? If so, is pandas temporarily storing each of the unique values (per group by value) somewhere outside of the `unique_chars` variable to determine what values are in fact unique before ultimately assigning the values to the variable? Maybe worth a new SO question. – deesolie Mar 05 '21 at 18:14
  • @deesolie `unique_chars` is a function of the vector x. Lambda is a short name for creating a one line function in python. The function is then used with each column vector as an input. – Paul Rougieux Mar 08 '21 at 16:18
  • I understand lambda functions. Let's say we had `df.groupby(['A'])['B'].transform(lambda x: ','.join(x.unique()))`, I am curious as to how pandas is temporarily storing each of the values in the group by series to check if the proceeding value is already in the joined string or not. Again, this may be a larger question and thus should be asked as a new SO question. – deesolie Mar 09 '21 at 17:13
  • 1
    @deesolie No pandas doesn't check if the value is already in the joined string. You got the order wrong. This is a vectorized operation. First, unique returns a vector of unique characters. Second, `.join` joins all the elements of the vector together. – Paul Rougieux Mar 13 '21 at 00:08
  • Thanks Paul, looking at the order of operations in that way is much easier to understand how this line is functioning. – deesolie Mar 14 '21 at 01:23