8

For example, I have two lambda functions to apply to a grouped data frame:

df.groupby(['A', 'B']).apply(lambda g: ...)
df.groupby(['A', 'B']).apply(lambda g: ...)

Both would work, but not when combined:

df.groupby(['A', 'B']).apply([lambda g: ..., lambda g: ...])

Why is that? How can I apply different functions to a grouped object and get each result concatenated column wise together?

Is there a way not to specify some column to a function? All you have suggested seemed to only work with certain columns.

James Wong
  • 1,107
  • 3
  • 15
  • 26
  • 1
    related and probable dupe: https://stackoverflow.com/questions/14529838/apply-multiple-functions-to-multiple-groupby-columns, is this what you're after? – EdChum Jun 02 '17 at 15:04
  • 2
    see the agg function. df.groupby(['field1']).agg({'field2':'mean','field3':'count'}) – flyingmeatball Jun 02 '17 at 15:04
  • 1
    Possible duplicate of [Apply multiple functions to multiple groupby columns](https://stackoverflow.com/questions/14529838/apply-multiple-functions-to-multiple-groupby-columns) – flyingmeatball Jun 02 '17 at 15:05
  • 1
    I don't need to apply different functions to different columns. I want the two functions applied on the whole grouped data frame. Am I missing something? – James Wong Jun 02 '17 at 15:34
  • `groups = df.groupby(...); result = groups.apply(...).join(groups.apply(...))` – Paul H Jun 02 '17 at 15:52
  • Initially I thought `join` would work but it does not and gives me `AttributeError: 'Series' object has no attribute 'join'`. That's where it gets super unintuitive and weird to me. Why Series doesn't have `join` as DataFrame? – James Wong Jun 03 '17 at 01:11

2 Answers2

8

This is a good opportunity to highlight one of the changes in pandas 0.20

Deprecate groupby.agg() with a dictionary when renaming

What does this mean?
Consider the dataframe df

df = pd.DataFrame(dict(
        A=np.tile([1, 2], 2).repeat(2),
        B=np.repeat([1, 2], 2).repeat(2),
        C=np.arange(8)
    ))
df

   A  B  C
0  1  1  0
1  1  1  1
2  2  1  2
3  2  1  3
4  1  2  4
5  1  2  5
6  2  2  6
7  2  2  7

We could previously do

df.groupby(['A', 'B']).C.agg(dict(f1=lambda x: x.size, f2=lambda x: x.max()))

     f1  f2
A B        
1 1   2   1
  2   2   5
2 1   2   3
  2   2   7

And our names 'f1' and 'f2' were placed as column headers. However, with pandas 0.20 I get this

//anaconda/envs/3.6/lib/python3.6/site-packages/ipykernel/__main__.py:1: FutureWarning: using a dict on a Series for aggregation
is deprecated and will be removed in a future version
  if __name__ == '__main__':

So what does that mean? What if I do two lambdas without the naming dictionary?

df.groupby(['A', 'B']).C.agg([lambda x: x.size, lambda x: x.max()])

---------------------------------------------------------------------------
SpecificationError                        Traceback (most recent call last)
<ipython-input-398-fc26cf466812> in <module>()
----> 1 print(df.groupby(['A', 'B']).C.agg([lambda x: x.size, lambda x: x.max()]))

//anaconda/envs/3.6/lib/python3.6/site-packages/pandas/core/groupby.py in aggregate(self, func_or_funcs, *args, **kwargs)
   2798         if hasattr(func_or_funcs, '__iter__'):
   2799             ret = self._aggregate_multiple_funcs(func_or_funcs,
-> 2800                                                  (_level or 0) + 1)
   2801         else:
   2802             cyfunc = self._is_cython_func(func_or_funcs)

//anaconda/envs/3.6/lib/python3.6/site-packages/pandas/core/groupby.py in _aggregate_multiple_funcs(self, arg, _level)
   2863             if name in results:
   2864                 raise SpecificationError('Function names must be unique, '
-> 2865                                          'found multiple named %s' % name)
   2866 
   2867             # reset the cache so that we

SpecificationError: Function names must be unique, found multiple named <lambda>

pandas errors on multiple columns named '<lambda>'

Solution: Name your functions

def f1(x):
    return x.size

def f2(x):
    return x.max()

df.groupby(['A', 'B']).C.agg([f1, f2])

     f1  f2
A B        
1 1   2   1
  2   2   5
2 1   2   3
  2   2   7
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • Great input! But when I explicitly named each function I got multiple errors. Like `TypeError: an integer is required` and `KeyError: 'o'`. I have no idea why that is. – James Wong Jun 03 '17 at 01:04
0

Why dont you use agg ?

df.groupby(['A', 'B']).agg(lambda g: ...)

Might be a new behaviour since you posted your question

louisD
  • 175
  • 1
  • 10