4

How do I specify custom aggregating functions so that they behave correctly when used in list arguments of pandas.DataFrame.aggregate?

Given a two-column dataframe in pandas ...

import pandas as pd
import numpy as np
df = pd.DataFrame(index=range(10))
df['a'] = [ 3 * x for x in range(10) ]
df['b'] = [ 1 -2 * x for x in range(10) ]

... aggregating over a list of aggregation function specs is not a problem:

def ok_mean(x):
  return x.mean()

df.aggregate(['mean', np.max, ok_mean])
               a    b
mean        13.5    -8.0
amax        27.0    1.0
ok_mean     13.5    -8.0

but when an aggregation is specified as a (lambda or named) function, this fails to aggregate:

def nok_mean(x):
  return np.mean(x)

df.aggregate([lambda x:  np.mean(x), nok_mean])
                   a                 b
   <lambda> nok_mean <lambda> nok_mean
0   0.0      0.0     1.0     1.0
1   3.0      3.0    -1.0    -1.0
2   6.0      6.0    -3.0    -3.0
3   9.0      9.0    -5.0    -5.0
4   12.0    12.0    -7.0    -7.0
...

Mixing aggregating and non-aggregating specs lead to errors:

df.aggregate(['mean', nok_mean])
~/anaconda3/envs/tsa37_jup/lib/python3.7/site-packages/pandas/core/base.py in _aggregate_multiple_funcs(self, arg, _level, _axis)
    607         # if we are empty
    608         if not len(results):
--> 609             raise ValueError("no results")
    610 

While using the aggregating function directly (not in list) gives the expected result:

df.aggregate(nok_mean)
a    13.5
b    -8.0
dtype: float64

Is this a bug or am I missing something in the way that I define aggregation functions? In my real project, i'm using more complex aggregation functions (such as a this percentile one). So my question is:

How do I specify custom aggregating function in order to workaround this bug?

Note that using the custom aggregating function over a rolling, expanding or group-by window gives the expected result:

df.expanding().aggregate(['mean', nok_mean])
## returns cumulative aggregation results as expected

Pandas version: 0.23.4

Community
  • 1
  • 1
plankthom
  • 111
  • 1
  • 5

2 Answers2

1

I found that making the aggregating function fail when called with a non-Series arguments is a work-around:

def ok_mean(x):
  return np.mean(x.values)

def ok_mean2(x):
  if not isinstance(x,pd.Series):
    raise ValueError('need Series argument')
  return np.mean(x)

df.aggregate(['mean', ok_mean, ok_mean2])

Seems that in this circumstance (in list argument to pandas.DataFrame.aggregate), pandas first tries to apply the aggregating function to each data point, and from the moment this fails, falls back to the correct behaviour (calling back with the Series to be aggregated).

Using a decorator to force Series arguments:

def assert_argtype(clazz):
    def wrapping(f):
        def wrapper(s):
            if not isinstance(s,clazz):
                raise ValueError('needs %s argument' % clazz)
            return f(s)
        return wrapper
    return wrapping

@assert_argtype(pd.Series)
def nok_mean(x):
    return np.mean(x)

df.aggregate([nok_mean])
## OK now, decorator fixed it!
plankthom
  • 111
  • 1
  • 5
  • See also [this github issue](https://github.com/pandas-dev/pandas/issues/19756#issuecomment-467807736) which I think is related – plankthom Feb 27 '19 at 10:50
0

Based on the answers to this question Pandas - DataFrame aggregate behaving oddly

It looks like it is because you are calling np.mean directly on individual values rather than across entire series in the dataframe. Changing the function to

def nok_mean(x):
    return x.mean()

Now allows you to apply multiple functions:

df.agg(['mean', nok_mean])

Returns

             a    b
mean      13.5 -8.0
nok_mean  13.5 -8.0
Ian Thompson
  • 2,914
  • 2
  • 18
  • 31
  • I refined the question to clarify the core issue: how do i specify a custom function that works correctly in this context. My example indicated that I know that the standard `pandas.DataFrame.mean` behaves correctly. But why, knowing that in general this aggregation function does work (non-list argument, in windowing contexts)? In reality, I need more complex aggregation functions than mean. – plankthom Feb 27 '19 at 08:18
  • Could you give a example of the "more complex aggregation" you are trying to perform? Along with some start data and expected output. – Ian Thompson Feb 27 '19 at 19:51