1

I have a function that I wish to apply to a subsets of a pandas DataFrame, so that the function is calculated on all rows (until current row) from the same group - i.e. using a groupby and then expanding.

For example, this dataframe:

df = pd.DataFrame.from_dict(
    {
        'group': ['A','A','A','B','B','B'],
        'time': [1,2,3,1,2,3],
        'x1': [10,40,30,100,200,300],
        'x2': [1,0,1,2,0,3]
                  }).sort_values('time')

i.e.

    group   time    x1      x2
0   A       1       10      1
3   B       1       100     2
1   A       2       40      2
4   B       2       200     0
2   A       3       30      1
5   B       3       300     3

and this function, for example:

def foo(_df):
    return _df['x1'].max() * _df['x2'].iloc[-1]

[Edited for clarity following feedback from jezrael: my actual function is more complicated, and cannot be easily broken down into components for this task. this simple function is just for an MCVE.]

I want to do something like: df['foo_result'] = df.groupby('group').expanding().apply(foo, raw=False)

To obtain this result:

    group   time    x1  x2  foo_result
0   A       1       10  1   10
3   B       1       100 2   200
1   A       2       40  2   80
4   B       2       200 0   0
2   A       3       30  1   40
5   B       3       300 3   900

Problem is, running df.groupby('group').expanding().apply(foo, raw=False) results in KeyError: 'x1'.

Is there a correct way to run this, or is it not possible to do so in pandas without breaking down my function into components?

Itamar Mushkin
  • 2,803
  • 2
  • 16
  • 32

2 Answers2

2

Applying a dataframe function on an expanding window is apparently not possible (at least not for pandas version 0.23.0; EDITED - and also not 1.3.0), as one can see by plugging a print statement into the function.

Running df.groupby('group').expanding().apply(lambda x: bool(print(x)) , raw=False) on the given DataFrame (where the bool around the print is just to get a valid return value) returns:

0    1.0
dtype: float64
0    1.0
1    2.0
dtype: float64
0    1.0
1    2.0
2    3.0
dtype: float64
0    10.0
dtype: float64
0    10.0
1    40.0
dtype: float64
0    10.0
1    40.0
2    30.0
dtype: float64

(and so on - and also returns a dataframe with '0.0' in each cell, of course).

This shows that the expanding window works on a column-by-column basis (we see that first the expanding time series is printed, then x1, and so on), and does not really work on a dataframe - so a dataframe function can't be applied to it.

So, to get the obtained functionality, one would have to put the expanding inside the dataframe function, like in the accepted answer.

Itamar Mushkin
  • 2,803
  • 2
  • 16
  • 32
1

An possible solution is to make the expanding part of the function, and use GroupBy.apply:

def foo1(_df):
    return _df['x1'].expanding().max() * _df['x2'].expanding().apply(lambda x: x[-1], raw=True)

df['foo_result'] = df.groupby('group').apply(foo1).reset_index(level=0, drop=True)
print (df)
  group  time   x1  x2  foo_result
0     A     1   10   1        10.0
3     B     1  100   2       200.0
1     A     2   40   2        80.0
4     B     2  200   0         0.0
2     A     3   30   1        40.0
5     B     3  300   3       900.0

This is not a direct solution to the problem of applying a dataframe function to an expanding dataframe, but it achieves the same functionality.

Itamar Mushkin
  • 2,803
  • 2
  • 16
  • 32
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 1
    Instead of the right side of multiplication- you can do: ```s = g['x1'].expanding().max() // df['foo_result'] = s.reset_index(level=0, drop=True)*df['x2']``` – Grzegorz Skibinski Jan 19 '20 at 13:05
  • 1
    Thank you for your help, but this function was just something I made up for a minimal, reproducible example; Breaking down my actual function to its components this way is not what I need – Itamar Mushkin Jan 19 '20 at 14:25
  • 1
    @ItamarMushkin hmmm, I try answer for `Problem is, functions on .expanding() don't work on entire dataframe, only per column... So, what can I do instead?` – jezrael Jan 19 '20 at 14:28
  • 1
    I see... then my question was not clear enough. I've edited it following your feedback. – Itamar Mushkin Jan 19 '20 at 14:35