1

Currently I am able to retrieve an array of summary statistics from a large groupby object (for example, the groupby object has 2000 dataframes, wherein I retrieve the mean value of each dataframes 'Z' column).

To do this I use the following process:

vals = mygroupby.aggregate(np.mean)['z'].values

I am also able to do this with np.std, np.var, etc. However, I would like to do this with np.percentile (i.e. return an array of all the 90th percentiles in the groupby object), but this requires additional arguments. This is what I have tried

vals = mygroupby.aggregate(np.percentile(90))['z'].values

With the following error:

TypeError: percentile() missing 1 required positional argument: 'q'

Which I understand is because I am missing the iterable for np.percentile. How do I tell np.percentile that the iterable is the aggregate itself, similar to how np.mean works?

Edit

Performance is a concern here, and using lambda functions within the argument slows drastically, whereas the np.mean example executes very quickly.

Bryce Frank
  • 697
  • 10
  • 24
  • http://stackoverflow.com/questions/26354329/python-pandas-passing-multiple-functions-to-agg-with-arguments – plasmon360 Mar 22 '17 at 21:15
  • 1
    You can pass any input function arguments to `aggregate`, which will in turn pass the arguments to the input function, i.e. `mygroupby.aggregate(np.percentile, q=90)`. Also, I think you want to specify the 'z' column prior to the `aggregrate`, as `np.percentile` will flatten your input array (do the percentile over all data, not by column), i.e. `mygroupby['z'].aggregate(np.percentile, q=90)`. If you only want to aggregate over the 'z' column you should do this for any function regardless, as it will be more efficient. – root Mar 22 '17 at 22:08

0 Answers0