4

I try to predict multiple independent time series simultaneously using sklearn linear regression model, but I seem not be able to get it right.

My data is organised as follow: Xn is a matrix where each row contains a forecast window of 4 observations and yn are the target values for each row of Xn.

import numpy as np

# training data
X1=np.array([[-0.31994,-0.32648,-0.33264,-0.33844],[-0.32648,-0.33264,-0.33844,-0.34393],[-0.33264,-0.33844,-0.34393,-0.34913],[-0.33844,-0.34393,-0.34913,-0.35406],[-0.34393,-0.34913,-.35406,-0.35873],[-0.34913,-0.35406,-0.35873,-0.36318],[-0.35406,-0.35873,-0.36318,-0.36741],[-0.35873,-0.36318,-0.36741,-0.37144],[-0.36318,-0.36741,-0.37144,-0.37529],[-0.36741,-.37144,-0.37529,-0.37896],[-0.37144,-0.37529,-0.37896,-0.38069],[-0.37529,-0.37896,-0.38069,-0.38214],[-0.37896,-0.38069,-0.38214,-0.38349],[-0.38069,-0.38214,-0.38349,-0.38475],[-.38214,-0.38349,-0.38475,-0.38593],[-0.38349,-0.38475,-0.38593,-0.38887]])
X2=np.array([[-0.39265,-0.3929,-0.39326,-0.39361],[-0.3929,-0.39326,-0.39361,-0.3931],[-0.39326,-0.39361,-0.3931,-0.39265],[-0.39361,-0.3931,-0.39265,-0.39226],[-0.3931,-0.39265,-0.39226,-0.39193],[-0.39265,-0.39226,-0.39193,-0.39165],[-0.39226,-0.39193,-0.39165,-0.39143],[-0.39193,-0.39165,-0.39143,-0.39127],[-0.39165,-0.39143,-0.39127,-0.39116],[-0.39143,-0.39127,-0.39116,-0.39051],[-0.39127,-0.39116,-0.39051,-0.3893],[-0.39116,-0.39051,-0.3893,-0.39163],[-0.39051,-0.3893,-0.39163,-0.39407],[-0.3893,-0.39163,-0.39407,-0.39662],[-0.39163,-0.39407,-0.39662,-0.39929],[-0.39407,-0.39662,-0.39929,-0.4021]])

# target values
y1=np.array([-0.34393,-0.34913,-0.35406,-0.35873,-0.36318,-0.36741,-0.37144,-0.37529,-0.37896,-0.38069,-0.38214,-0.38349,-0.38475,-0.38593,-0.38887,-0.39184])
y2=np.array([-0.3931,-0.39265,-0.39226,-0.39193,-0.39165,-0.39143,-0.39127,-0.39116,-0.39051,-0.3893,-0.39163,-0.39407,-0.39662,-0.39929,-0.4021,-0.40506])

The normal procedure for a single time series, which works as expected, is as follow:

from sklearn.linear_model import LinearRegression

# train the 1st half, predict the 2nd half
half = len(y1)/2 # or y2 as they have the same length
LR = LinearRegression()
LR.fit(X1[:half], y1[:half])
pred = LR.predict(X1[half:])
r_2 = LR.score(X1[half:],y1[half:])

But how to apply the linear regression model to multiple independent time series at the same time? I tried the following:

y_stack = np.vstack((y1[None],y2[None]))
X_stack = np.vstack((X1[None],X2[None]))

print 'y1 shape:',y1.shape, 'X1 shape:',X1.shape
print 'y_stack shape:',y_stack.shape, 'X_stack:',X_stack.shape
y1 shape: (16,) X1 shape: (16, 4)
y_stack shape: (2, 16) X_stack: (2, 16, 4)

But then the fitting of the linear model fails as follow:

LR.fit(X_stack[:,half:],y_stack[:,half:])

Stating that number of dimensions are higher than expected:

C:\Python27\lib\site-packages\sklearn\utils\validation.pyc in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
        394         if not allow_nd and array.ndim >= 3:
        395             raise ValueError("Found array with dim %d. %s expected <= 2."
    --> 396                              % (array.ndim, estimator_name))
        397         if force_all_finite:
        398             _assert_all_finite(array)

    ValueError: Found array with dim 3. Estimator expected <= 2.

Any advice or hints are much appreciated.

UPDATE

I could make use of a for loop, but as n is in reality in order of 10000 or more, I was hoping to find solutions that include array operations as these are the explicit power of numpy, scipy and hopefully sklearn

Mattijn
  • 12,975
  • 15
  • 45
  • 68
  • Why can't you treat your data as just one set of dependent and independent variables? – Riyaz Dec 23 '15 at 13:40
  • @Riyaz my data or each time series are not related to each other, I expect this must be the case when treating them as one set? – Mattijn Dec 23 '15 at 13:44
  • Yes, indeed regression works on correlation. Then can't you afford to build two separate models? – Riyaz Dec 23 '15 at 13:45
  • @Riyaz In this post I use two as an example, but in reality `n` can be in the range of 10000 or more – Mattijn Dec 23 '15 at 13:47
  • well, that is the problem. I am not sure, if you can jointly model data which are not correlated. To me the only option seems building separate models, but it may not be feasible in your case. – Riyaz Dec 23 '15 at 13:53
  • would looping cause an issue here? – jfish003 Dec 23 '15 at 14:02
  • @jfish003, thats what I am suggesting implicitly. – Riyaz Dec 23 '15 at 14:03
  • @jfish003, I could make use of a for loop but the power of numpy and scipy is explicitly their capability of array operations. Therefore my question – Mattijn Dec 23 '15 at 14:10
  • 2
    Possible duplicate of http://stackoverflow.com/q/30442377/1461210 – ali_m Dec 23 '15 at 17:44

2 Answers2

3

@ali_m I don't think this is a duplicate question, but they are partly related. And of course it's possible to apply and predict time series simultaneously using a linear regression model similar to sklearn:

I created a new class LinearRegression_Multi:

class LinearRegression_Multi:
    def stacked_lstsq(self, L, b, rcond=1e-10):
        """
        Solve L x = b, via SVD least squares cutting of small singular values
        L is an array of shape (..., M, N) and b of shape (..., M).
        Returns x of shape (..., N)
        """
        u, s, v = np.linalg.svd(L, full_matrices=False)
        s_max = s.max(axis=-1, keepdims=True)
        s_min = rcond*s_max
        inv_s = np.zeros_like(s)
        inv_s[s >= s_min] = 1/s[s>=s_min]
        x = np.einsum('...ji,...j->...i', v,
                      inv_s * np.einsum('...ji,...j->...i', u, b.conj()))
        return np.conj(x, x)    

    def center_data(self, X, y):
        """ Centers data to have mean zero along axis 0. 
        """
        # center X        
        X_mean = np.average(X,axis=1)
        X_std = np.ones(X.shape[0::2])
        X = X - X_mean[:,None,:] 
        # center y
        y_mean = np.average(y,axis=1)
        y = y - y_mean[:,None]
        return X, y, X_mean, y_mean, X_std

    def set_intercept(self, X_mean, y_mean, X_std):
        """ Calculate the intercept_
        """
        self.coef_ = self.coef_ / X_std # not really necessary
        self.intercept_ = y_mean - np.einsum('ij,ij->i',X_mean,self.coef_)

    def scores(self, y_pred, y_true ):
        """ 
        The coefficient R^2 is defined as (1 - u/v), where u is the regression
        sum of squares ((y_true - y_pred) ** 2).sum() and v is the residual
        sum of squares ((y_true - y_true.mean()) ** 2).sum().        
        """        
        u = ((y_true - y_pred) ** 2).sum(axis=-1)
        v = ((y_true - y_true.mean(axis=-1)[None].T) ** 2).sum(axis=-1)
        r_2 = 1 - u/v
        return r_2

    def fit(self,X, y):
        """ Fit linear model.        
        """        
        # get coefficients by applying linear regression on stack
        X_, y, X_mean, y_mean, X_std = self.center_data(X, y)
        self.coef_ = self.stacked_lstsq(X_, y)
        self.set_intercept(X_mean, y_mean, X_std)

    def predict(self, X):
        """Predict using the linear model
        """
        return np.einsum('ijx,ix->ij',X,self.coef_) + self.intercept_[None].T

Which can be applied as follow, using the same declared variables as in the question:

LR_Multi = LinearRegression_Multi()
LR_Multi.fit(X_stack[:,:half], y_stack[:,:half])
y_stack_pred = LR_Multi.predict(X_stack[:,half:])
R2 = LR_Multi.scores(y_stack_pred, y_stack[:,half:])

Where the R^2 for the multiple time series are:

array([ 0.91262442,  0.67247516])

Which is indeed similar to the prediction method of the standard sklearn linear regression:

from sklearn.linear_model import LinearRegression

LR = LinearRegression()
LR.fit(X1[:half], y1[:half])
R2_1 = LR.score(X1[half:],y1[half:])

LR.fit(X2[:half], y2[:half])
R2_2 = LR.score(X2[half:],y2[half:])
print R2_1, R2_2
0.912624422097 0.67247516054
Mattijn
  • 12,975
  • 15
  • 45
  • 68
0

If you need to build separate models, there is no possibility to use the power of numpy for getting performance improvement of the fact you have many different tasks. The only thing you can do is to run them simultaneously in different threads (by using multi cores of you CPU) or even split calculations to different computers.

If you believe all the data fit the same model, then the obvious solution is just to merge all the Xn and yn and learn on them. This will definitely be faster then calculating separate models.

But in fact the question is not in the calculations performance but in the result you want to get. If you need different models you have no options, just calculate them separately. If you need one model, just merge the data. Otherwise, if you would calculate separate models you'll get the problem: how to get the final parameters from all that models.

talonmies
  • 70,661
  • 34
  • 192
  • 269
Leonid Mednikov
  • 943
  • 4
  • 13