I'm interested in computing a statistic over a rolling window. The statistic will be computed over multiple columns. Here is a toy example calculating regression coefficients over a rolling window.
def regression_coef(df):
if df.shape[0]==0:
return np.array([np.NaN, np.NaN])
y = df.y.values
X = df.drop('y',axis = 1).values
reg = LinearRegression().fit(X,y).coef_.round(2)
return reg
time = np.arange(5,3605,5)
x = np.random.normal(size = time.size)
z = np.random.normal(size = time.size)
y = 2*x+z + np.random.normal(size = time.size)
df = pd.DataFrame({'x':x, 'z':z, 'y':y}, index = pd.to_datetime(time, unit ='s'))
When I call df.rolling('20 T').apply(regression_coef)
I get the following error: AttributeError: 'numpy.ndarray' object has no attribute 'y'
. This leads me to believe that df.rolling
is computes statistics over the individual columns, rather than finding all observations within the 20 minute window.
How can I achieve what I want? That is to say, how can I compute regression_coef
in a rolling window? In particular, I'm interested if this can be solved for use with offsets and with the existing pandas API.