13

I want to do a rolling computation on missing data.

Sample Code: (For sake of simplicity I'm giving an example of a rolling sum but I want to do something more generic.)

foo = lambda z: z[pandas.notnull(z)].sum() 
x = np.arange(10, dtype="float")    
x[6] = np.NaN
x2 = pandas.Series(x)    
pandas.rolling_apply(x2, 3, foo)

which produces:

0   NaN    
1   NaN
2     3    
3     6    
4     9    
5    12    
6   NaN    
7   NaN    
8   NaN    
9    24

I think that during the "rolling", window with missing data is being ignored for computation. I'm looking to get a result along the lines of:

0   NaN    
1   NaN    
2     3    
3     6    
4     9    
5    12    
6     9    
7    12    
8    15    
9    24
Matthew Lundberg
  • 42,009
  • 6
  • 90
  • 112
Mahesh
  • 131
  • 1
  • 5
  • 4
    I think a partial answer to this question is probably via using the keyword argument min_periods in the rolling apply function. Ex: pandas.rolling_apply(x2, 3, foo, min_periods=1) helps. – Mahesh Nov 15 '12 at 21:32

2 Answers2

11
In [7]: pandas.rolling_apply(x2, 3, foo, min_periods=2)
Out[7]: 
0   NaN
1     1
2     3
3     6
4     9
5    12
6     9
7    12
8    15
9    24
Chang She
  • 16,692
  • 8
  • 40
  • 25
  • 1
    For those wondering, this is from the [docs](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.rolling_apply.html): ``min_periods: Minimum number of observations in window required to have a value (otherwise result is NA).`` – Noah Dec 16 '15 at 22:29
  • Any corrections or suggestions of my general answer would be appreciated. The OP is long gone, but the well worded question remains. –  Mar 08 '22 at 21:25
0

It would be better to replace the NA values in the data-set with logical substitutions before operating on them.


For Numerical Data:

For your given example, a simple mean around the NA would fill it perfectly, but what if x[7] = np.NaN were eliminated as well?

Analysis of the surrounding data shows a linear pattern, so a lerp(linear-interpolate) is in order.

Same goes for polynomial, exponential, log, and periodic(cosine) data.

If an inflection point, a change in the second derivative of the data(subtract pairwise points twice, and note if the sign changes), happens during the missing data, its position is unknowable unless the other side picks it up perfectly, if not, pick a random point and continue.


For Categorical Data:

from scipy import stats

Use:

x=pandas.rolling_apply(x2, 3, (lambda x : stats.mode(x,nan_policy='omit')) to replace the missing values with the most common of the nearest 3.


For Static data:

Use:

Replace 0 with the appropriate value.

x = x.fillna(0)