I try to predict multiple independent time series simultaneously using sklearn linear regression model, but I seem not be able to get it right.
My data is organised as follow: Xn
is a matrix where each row contains a forecast window of 4 observations and yn
are the target values for each row of Xn
.
import numpy as np
# training data
X1=np.array([[-0.31994,-0.32648,-0.33264,-0.33844],[-0.32648,-0.33264,-0.33844,-0.34393],[-0.33264,-0.33844,-0.34393,-0.34913],[-0.33844,-0.34393,-0.34913,-0.35406],[-0.34393,-0.34913,-.35406,-0.35873],[-0.34913,-0.35406,-0.35873,-0.36318],[-0.35406,-0.35873,-0.36318,-0.36741],[-0.35873,-0.36318,-0.36741,-0.37144],[-0.36318,-0.36741,-0.37144,-0.37529],[-0.36741,-.37144,-0.37529,-0.37896],[-0.37144,-0.37529,-0.37896,-0.38069],[-0.37529,-0.37896,-0.38069,-0.38214],[-0.37896,-0.38069,-0.38214,-0.38349],[-0.38069,-0.38214,-0.38349,-0.38475],[-.38214,-0.38349,-0.38475,-0.38593],[-0.38349,-0.38475,-0.38593,-0.38887]])
X2=np.array([[-0.39265,-0.3929,-0.39326,-0.39361],[-0.3929,-0.39326,-0.39361,-0.3931],[-0.39326,-0.39361,-0.3931,-0.39265],[-0.39361,-0.3931,-0.39265,-0.39226],[-0.3931,-0.39265,-0.39226,-0.39193],[-0.39265,-0.39226,-0.39193,-0.39165],[-0.39226,-0.39193,-0.39165,-0.39143],[-0.39193,-0.39165,-0.39143,-0.39127],[-0.39165,-0.39143,-0.39127,-0.39116],[-0.39143,-0.39127,-0.39116,-0.39051],[-0.39127,-0.39116,-0.39051,-0.3893],[-0.39116,-0.39051,-0.3893,-0.39163],[-0.39051,-0.3893,-0.39163,-0.39407],[-0.3893,-0.39163,-0.39407,-0.39662],[-0.39163,-0.39407,-0.39662,-0.39929],[-0.39407,-0.39662,-0.39929,-0.4021]])
# target values
y1=np.array([-0.34393,-0.34913,-0.35406,-0.35873,-0.36318,-0.36741,-0.37144,-0.37529,-0.37896,-0.38069,-0.38214,-0.38349,-0.38475,-0.38593,-0.38887,-0.39184])
y2=np.array([-0.3931,-0.39265,-0.39226,-0.39193,-0.39165,-0.39143,-0.39127,-0.39116,-0.39051,-0.3893,-0.39163,-0.39407,-0.39662,-0.39929,-0.4021,-0.40506])
The normal procedure for a single time series, which works as expected, is as follow:
from sklearn.linear_model import LinearRegression
# train the 1st half, predict the 2nd half
half = len(y1)/2 # or y2 as they have the same length
LR = LinearRegression()
LR.fit(X1[:half], y1[:half])
pred = LR.predict(X1[half:])
r_2 = LR.score(X1[half:],y1[half:])
But how to apply the linear regression model to multiple independent time series at the same time? I tried the following:
y_stack = np.vstack((y1[None],y2[None]))
X_stack = np.vstack((X1[None],X2[None]))
print 'y1 shape:',y1.shape, 'X1 shape:',X1.shape
print 'y_stack shape:',y_stack.shape, 'X_stack:',X_stack.shape
y1 shape: (16,) X1 shape: (16, 4)
y_stack shape: (2, 16) X_stack: (2, 16, 4)
But then the fitting of the linear model fails as follow:
LR.fit(X_stack[:,half:],y_stack[:,half:])
Stating that number of dimensions are higher than expected:
C:\Python27\lib\site-packages\sklearn\utils\validation.pyc in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
394 if not allow_nd and array.ndim >= 3:
395 raise ValueError("Found array with dim %d. %s expected <= 2."
--> 396 % (array.ndim, estimator_name))
397 if force_all_finite:
398 _assert_all_finite(array)
ValueError: Found array with dim 3. Estimator expected <= 2.
Any advice or hints are much appreciated.
UPDATE
I could make use of a for loop, but as n
is in reality in order of 10000 or more, I was hoping to find solutions that include array operations as these are the explicit power of numpy, scipy and hopefully sklearn