2

I'm having some trouble to use .apply or .aggregate in pandas on a rolling basis (assuming of course that it is the right way to solve my problem). Let's assume I have a dataframe with two columns A and B. I would like to create a column C that will contain the rolling mean of B if A is having the value of 1. And more generally I would like to be able to apply a custom function on a rolling basis with some conditions involving several columns of the dataframe (e.g. rolling sum of column A when B > x and/or C = y etc.).

import pandas as pd
import numpy as np
df2 = pd.DataFrame({'A':[1,1,1,0,0,0,1,1,1],'B': [50,40,50,-20,20,10,10,-5,-2]}, index = np.arange(9))

Desired output would be (assuming a rolling window of 3):

df2 = pd.DataFrame({'A':[1,1,1,0,0,0,1,1,1],'B': [50,40,50,-20,20,10,10,-5,-2],\
 'C': [np.nan, np.nan, 46.67, 45, 50, np.nan, 10, 2.50, 1]}, index = np.arange(9))

I have tried to define a function mean_1 as follows:

def mean_1(x):
    return np.where(x['A'] == 1, np.mean(x['B']), np.nan)

df2['C'] = df2.rolling(3).apply(mean_1)

and got the error: 'Series' object has no attribute 'A' I guess it is due related to the raw = False in the documentation Thanks

CTXR
  • 139
  • 1
  • 9

3 Answers3

1

Here is a way to get close to what you're desired output.

df2['C'] = df2.apply(lambda row: np.where(row['A']==1, row['B'], np.nan), axis=1).rolling(3, min_periods=1).apply(np.nanmean)

The difference is that the above give a value for index 0 and 1.

Nabil Daoud
  • 221
  • 1
  • 10
  • Thank you for your answer. That is an interesting solution and having a value for index 0 and 1... I can live with that for now – CTXR Nov 27 '20 at 20:37
  • @CTRX - check 2 other solution. Don't know about the size of the data - but I would speculate, that this approach will be significantly slower than others - single `.apply(...)` is already slow, in here you have 2 of them, some reference: https://stackoverflow.com/a/54432584/11610186 – Grzegorz Skibinski Nov 27 '20 at 20:51
1

You can first mask the 'B' values where 'A' is not 1, then apply the rolling method:

mask_map = df2.A != 1
df2['C'] = df2.B.mask(mask_map).rolling(3, min_periods=1).mean().round(2)

Output:

   A   B      C
0  1  50  50.00
1  1  40  45.00
2  1  50  46.67
3  0 -20  45.00
4  0  20  50.00
5  0  10    NaN
6  1  10  10.00
7  1  -5   2.50
8  1  -2   1.00

Note that the first values are not NaN because we specified min_periods=1. It means that we're taking the mean regardless the quantity of missing values. So, if that's the case, and if you really want to set the first values as NaN, this can be done with:

df2.iloc[:n-1, df2.columns.get_loc('C')] = np.nan

where n is the window size (3 in this case). This will return the exact desired output.

Best!

Cainã Max Couto-Silva
  • 4,839
  • 1
  • 11
  • 35
  • Just a note - I would generally trust that OP didn't mean to use `min_periods=1`, as it changes the result in a meaningful way here (your output differs from what OP requested). You can post it as a remark, after the answer matching the requirement, but don't base your answer on sheer speculation (a long shot - I would say - in this case) – Grzegorz Skibinski Nov 27 '20 at 20:56
  • I've mistakenly masked the output as well. Updated answer, thx! In fact, I assumed that the OP requires `min_periods=1` because the result for index 6 is 10, and there's only NaN other than the it's own value. But what do you mean by the OP doesn't want `min_periods=1`? Your answer does that (great answer, btw), but manually, since you're taking the mean regardless `NaN`. – Cainã Max Couto-Silva Nov 27 '20 at 21:31
1

You can vectorize your solution:

df2['C'] = df2['A'].eq(1).mul(df2['B']).rolling(3).sum()\
    .div(df2['A'].eq(1).rolling(3).sum())\
    .round(2)

If you are asking more general in terms of any function - my advise would be - always try to vectorize, generally avoid .apply(...)

Grzegorz Skibinski
  • 12,624
  • 2
  • 11
  • 34