5

I had asked this question before: python pandas: applying different aggregate functions to different columns but the latest changes to pandas https://github.com/pandas-dev/pandas/pull/15931 mean that what I thought was an elegant and pythonic solution is deprecated, for reasons I genuinely fail to understand.

The question was, and still is: when doing a groupby, how can I apply different aggregate functions to different fields (e.g. sum of x, avg of x, min of y, max of z, etc.) and rename the resulting fields, all in one go, or at least in a possibly pythonic and not-too-cumbersome way? I.e. sum_x won't do, I need to rename the fields explicitly.

This approach, which I liked:

df.groupby('qtr').agg({"realgdp": {"mean_gdp": "mean", "std_gdp": "std"},
                                "unemp": {"mean_unemp": "mean"}})

will be deprecated and now produces this warning:

FutureWarning: using a dict with renaming is deprecated and will be removed in a future version

Thanks!

Pythonista anonymous
  • 8,140
  • 20
  • 70
  • 112
  • 1
    you got an answer here https://stackoverflow.com/questions/44635626/pandas-aggregation-warning-futurewarning-using-a-dict-with-renaming-is-depreca – BENY Oct 11 '17 at 17:31
  • 3
    But, as @ErnestScribbler commented on that answer, that doesn't take care of the renaming. I suppose it has to be done manually? With large dataframes with lots of columns, this means that not only do I have to replace my old code, but that the new code is way longer. All of this why??? – Pythonista anonymous Oct 11 '17 at 17:46
  • 1
    I too struggle to understand why this was done. It feels so incredibly unpythonic and gets really cumbersome really quickly, especially if I do not know how the new columns will actually be named. Maybe opening yet another thread on github about this will help? It just feels like bad design :-( – Thomas Mar 27 '19 at 13:35
  • 1
    Frustratingly, I feel compelled to use PySpark even if not necessary simply because I like the syntax so much more: df.groupby("col1").agg(F.col(col2).mean().alias("myaggcolumn"), F.col(col3).max().alias("mymaxcolumn"). Immediately clear what the column names will be, no matter what the aggregation functions spits out. I can comment out/in single lines without having to change anything else – Thomas Mar 27 '19 at 13:37

1 Answers1

3

agg() is not deprecated but renaming using agg is.

Do go through the documentation: https://pandas.pydata.org/pandas-docs/stable/whatsnew.html#deprecate-groupby-agg-with-a-dictionary-when-renaming

What is deprecated: 1. Passing a dict to a grouped/rolled/resampled Series that allowed one to rename the resulting aggregation 2. Passing a dict-of-dicts to a grouped/rolled/resampled DataFrame.

This will work, though its not a single line of code

df.groupby('qtr').agg({"realgdp": ["mean",  "std"], "unemp": "mean"})

df.columns = df.columns.map('_'.join)

df.rename(columns = {'realgdp_mean': 'mean_gdp', 'realgdp_std':'std_gdp', 'unemp_mean':'mean_unemp'}, inplace = True)
Vaishali
  • 37,545
  • 5
  • 58
  • 86
  • 1
    I would have thought of r.columns = [' '.join(col).strip() for col in r.columns.values] but your line is fewer characters! Thanks for the clarification. I still struggle to understand why on earth this is being deprecated. Removing backward compatibility should be a last resort. Changing all existing code is a huge pain. I see the downsides, I do not see a single upside! – Pythonista anonymous Oct 11 '17 at 18:00
  • 1
    Actually, renaming is still a problem if I use more than one lambda function on the same column (e.g. to calculate % of sum and % of count), because then I'd end up with two columns with the same name, two x_lambda – Pythonista anonymous Oct 12 '17 at 12:16
  • Zetrin's comment on 12-Oct-2017 puts it way more eloquently than I could have: https://github.com/pandas-dev/pandas/pull/15931 – Pythonista anonymous Oct 12 '17 at 16:09
  • Yes but the solution remains the same, use agg and then combine multiindex columns – Vaishali Oct 12 '17 at 16:14
  • I'm not following how this solves the issue with lambda functions. If I have two lambda functions on column x, I end up with two columns with the same name. – Pythonista anonymous Oct 12 '17 at 21:50