1

Using Pandas 1.1.5, I have a test DataFrame like the following:

import numpy as np
import pandas as pd
df = pd.DataFrame({'id': ['a0','a0','a0','a1','a1','a1','a2','a2'],
                   'a': [4,5,6,1,2,3,7,9],
                   'b': [3,4,5,3,2,4,1,3],
                   'c': [7,4,3,8,9,7,4,6],
                   'denom_a': [7,8,9,7,8,9,7,8],
                   'denom_b': [10,11,12,10,11,12,10,11]})

I would like to apply the following custom aggregate function on a rolling window where the function's calculation depends on the column name as so:

def custom_func(s, df, colname):
  if 'a' in colname:
    denom = df.loc[s.index, "denom_a"]
    calc = s.sum() / np.max(denom)
  elif 'b' in colname:
    denom = df.loc[s.index, "denom_b"]
    calc = s.sum() / np.max(denom)
  else:
    calc = s.mean()
  return calc

df.groupby('id')\
  .rolling(2, 1)\
  .apply(lambda x: custom_func(x, df, x.name))

This results in TypeError: argument of type 'NoneType' is not iterable because the windowed subsets of each column do not retain the names of the original df columns. That is, x.name being passed in as an argument is in fact passing None rather than a string of the original column name.

Is there some way of making this approach work (say, retaining the column name being acted on with apply and passing that into the function)? Or are there any suggestions for altering it? I consulted the following reference for having the custom function utilize multiple columns within the same window calculation, among others:

https://stackoverflow.com/a/57601839/6464695

chris_lee
  • 27
  • 8
  • So, copying and pasting what you have here, I actually get a `DataError: No numeric types to aggregate`. (Just pointing this out.) Additionally, the `.groupby` here isn't helpful, as we'll just get the three rows as individual groups. Could you flesh out your example `df` a bit more? – John Jul 21 '21 at 23:01
  • @John good point RE: the `.groupby`, I've updated the df accordingly to be a bit more helpful. As for the `DataError`, I'm not able to reproduce it on my end. – chris_lee Jul 21 '21 at 23:07
  • `if 'a' in colname` --> `if colname == 'a'` (and so on)? – Brendan Jul 21 '21 at 23:18
  • 1
    @Brendan unfortunately not, it's intentionally looking up a substring within the string that is the current colname (this sample data is a simpler, dummy version of my real df). It might normally be something like `if 'per_hour' in colname`, etc. – chris_lee Jul 21 '21 at 23:52

1 Answers1

1

I wouldn't be surprised if there's a "better" solution, but I think could at least be a "good start" (I don't do a whole lot with .rolling(...)).

With this solution, I make two critical assumptions:

  1. All denom_<X> have a corresponding <X> column.
  2. Everything you do with the (<X>, denom_<X>) pairs is the same. (This should be straightforward to customize as needed.)

With that said, I do the .rolling within the function, rather than outside, in part because it seems like .apply(...) on a RollingGroupBy can only work column-wise, which isn't too helpful here (imo).

def cust_fn(df: pd.DataFrame, rolling_args: Tuple) -> pd.Series:
    cols = df.columns
    denom_cols = ["id"]  # the whole dataframe is passed, so place identifiers / uncomputable variables here.

    for denom_col in cols[cols.str.startswith("denom_")]:
        denom_cols += [denom_col, denom_col.replace("denom_", "")]
        col = denom_cols[-1]  # sugar
        df[f"calc_{col}"] = df[col].rolling(*rolling_args).sum() / df[denom_col].max()

    for col in cols[~cols.isin(denom_cols)]:
        print(col, df[col])
        df[f"calc_{col}"] = df[col].rolling(*rolling_args).mean()
    
    return df

Then the way you'd go about running this is the following (and you get the corresponding output):

>>> df.groupby("id").apply(cust_fn, rolling_args=(2, 1))
   id  a  b  c  denom_a  denom_b    calc_a    calc_b  calc_c
0  a0  4  3  7        7       10  0.444444  0.250000     7.0
1  a0  5  4  4        8       11  1.000000  0.583333     5.5
2  a0  6  5  3        9       12  1.222222  0.750000     3.5
3  a1  1  3  8        7       10  0.111111  0.250000     8.0
4  a1  2  2  9        8       11  0.333333  0.416667     8.5
5  a1  3  4  7        9       12  0.555556  0.500000     8.0
6  a2  7  1  4        7       10  0.875000  0.090909     4.0
7  a2  9  3  6        8       11  2.000000  0.363636     5.0

If you need dynamically state which non-numeric/computable columns exist, then it might make sense to define cust_fn as follows:

def cust_fn(df: pd.DataFrame, rolling_args: Tuple, index_cols: List = []) -> pd.Series:
    cols = df.columns
    denon_cols = index_cols

    # ... the rest is unchanged

Then you would adapt your calling of cust_fn as follows:

>>> df.groupby("id").apply(cust_fn, rolling_args=(2, 1), index_cols=["id"])

Of course, comment on this if you run into issues adapting it to your uses.

John
  • 320
  • 2
  • 9
  • 1
    Thank you so much, John - this was incredibly helpful! Unfortunately the assumption that there exists a `denom_` for every `` does not hold for me (`hour` denom gets reused for multiple `per_hour` columns for example), but the key was pulling `rolling` into `custom_func` rather than calling it before `apply`, as well as using separate calls to `rolling` for the numerator and denominator. With those two changes, I was able to solve my problem - I will keep your answer as best answer though as it solves the problem as posted! Thanks again! – chris_lee Jul 22 '21 at 02:57
  • 1
    Ah! I figured my assumption might have been dangerous. Good to know that you were able to use it as inspiration, though! – John Jul 22 '21 at 03:04