2

I want to apply a weighted sum to a DataFrame. In the past I have used

for sec_id, sec_df_unidx in grouped:
    if sec_df_unidx.shape[0] > 3:
        pd.rolling_sum(sec_df[added_cols], 4)

I want to apply a weighted average to the sum where the most recent is multiplied by 0.6, 2nd by 0.2, 3rd and 4th by 0.1.

DF:

DATE    ID  VALUE
2012-12-31  A   100
2013-03-31  A   120
2013-06-30  A   140
2013-09-30  A   160
2013-12-31  A   180
2013-03-31  B   0
2013-06-30  B   5
2013-09-30  B   1
2013-12-31  B   3
2012-12-31  C   45
2013-03-31  C   46
2013-06-30  C   42
2013-09-30  C   30
2013-12-31  C   11
2012-12-31  D   18
2013-03-31  D   9
2013-06-30  D   13
2013-09-30  D   5
2013-12-31  D   11
2012-12-31  E   0

DF with new column:

DATE    ID  VALUE   Weight_Sum
2012-12-31  A   100 NaN
2013-03-31  A   120 NaN
2013-06-30  A   140 NaN
2013-09-30  A   160 146
2013-12-31  A   180 166
2013-03-31  B   0   NaN
2013-06-30  B   5   NaN
2013-09-30  B   1   NaN
2013-12-31  B   3   2.5
2012-12-31  C   45  NaN
2013-03-31  C   46  NaN
2013-06-30  C   42  NaN
2013-09-30  C   30  35.5
2013-12-31  C   11  21.4
2012-12-31  D   18  NaN
2013-03-31  D   9   NaN
2013-06-30  D   13  NaN
2013-09-30  D   5   8.3
2013-12-31  D   11  9.8
2012-12-31  E   0   NaN

Can I do this with a rolling_apply or rolling_sum? Or will I have to do a for loop?

Thank you.

user1911092
  • 3,941
  • 9
  • 26
  • 30

1 Answers1

2

I think you can do it will a rolling_apply within a function called by a normal groupby/apply. So something like the following:

def roll_wsum(g,w,p):
    rsum = pd.rolling_apply(g.values,p,lambda x: np.dot(w,x),min_periods=p)
    return pd.Series(rsum,index=g.index)

weights = np.array([0.1,0.1,0.2,0.6])
df['wsum'] = df.groupby('ID')['VALUE'].apply(roll_wsum,weights,4)
print df

Output:

         DATE ID  VALUE   wsum
0  2012-12-31  A    100    NaN
1  2013-03-31  A    120    NaN
2  2013-06-30  A    140    NaN
3  2013-09-30  A    160  146.0
4  2013-12-31  A    180  166.0
5  2013-03-31  B      0    NaN
6  2013-06-30  B      5    NaN
7  2013-09-30  B      1    NaN
8  2013-12-31  B      3    2.5
9  2012-12-31  C     45    NaN
10 2013-03-31  C     46    NaN
11 2013-06-30  C     42    NaN
12 2013-09-30  C     30   35.5
13 2013-12-31  C     11   21.4
14 2012-12-31  D     18    NaN
15 2013-03-31  D      9    NaN
16 2013-06-30  D     13    NaN
17 2013-09-30  D      5    8.3
18 2013-12-31  D     11    9.8
19 2012-12-31  E      0    NaN

So, I am just grouping the data by 'ID' and then sending the 'VALUE' column for a group to my roll_wsum function (along with the weights for the weighted sum and the periods). The roll_wsum function calls rolling_apply and feeds a simple lambda function to the rolling_apply: the dot product of 'VALUE' and the weights. Also, it is critical here to impose the min_periods=4 condition because we need the length of the arrays (weights and df['VALUE'].values) to be the same.

Given I use the dot product to compute the weighted sum it may not handle missing values the way you want. So, for example, you may prefer the following (although it doesn't make a difference for the example data):

def roll_wsum(g,w,p):
    rsum = pd.rolling_apply(g.values,p,lambda x: np.nansum(w*x),min_periods=p)
    return pd.Series(rsum,index=g.index)

weights = np.array([0.1,0.1,0.2,0.6])
df['wsum'] = df.groupby('ID')['VALUE'].apply(roll_wsum,weights,4)
Karl D.
  • 13,332
  • 5
  • 56
  • 38