9

I would like to predict multiple dependent variables using multiple predictors. If I understood correctly, in principle one could make a bunch of linear regression models that each predict one dependent variable, but if the dependent variables are correlated, it makes more sense to use multivariate regression. I would like to do the latter, but I'm not sure how.

So far I haven't found a Python package that specifically supports this. I've tried scikit-learn, and even though their linear regression model example only shows the case where y is an array (one dependent variable per observation), it seems to be able to handle multiple y. But when I compare the output of this "multivariate" method to the results I get by manually looping over each dependent variable and predicting them independently from each other, the outcome is exactly the same. I don't think this should be the case, because there is a strong correlation between some of the dependent variables (>0.5).

The code just looks like this, with y either a n x 1 matrix or n x m matrix, and x and newx matrices of various sizes (number of rows in x == n).

ols = linear_model.LinearRegression()
ols.fit(x,y)
ols.predict(newx)

Does this function actually perform multivariate regression?

JasonMArcher
  • 14,195
  • 22
  • 56
  • 52
CSquare
  • 624
  • 6
  • 16

2 Answers2

7

If you want to take into account the correlation between the dependent variables, you probably need Partial least square regression. This method is basically searching for such projection of independent variables and such projection of the dependent variables, that the covariance between these two projections is maximized. See scikit-learn implementation here.

lanenok
  • 2,699
  • 17
  • 24
6

This is a mathematical/stats question, but I will try to answer it here anyway.

The outcome you see is absolutely expected. A linear model like this won't take correlation between dependent variables into account.

If you had only one dependent variable, your model would essentially consist of a weight vector

w_0  w_1  ...  w_n,

where n is the number of features. With m dependent variables, you instead have a weight matrix

w_10  w_11  ...  w_1n
w_20  w_21  ...  w_2n
....             ....
w_m0  w_m1  ...  w_mn

But the weights for different output variables (1, ..., m) are completely independent from each other, and since the total sum of squared errors splits up into a sum of squared errors over each output variable, minimizing the squared total loss is exactly the same as setting up one univariate linear model per output variable and minimizing their squared losses independently from each other.

cfh
  • 4,576
  • 1
  • 24
  • 34
  • Thanks for the explanation! I misunderstood how this would work with multiple dependent variables. I'll compare this result with [lanenok's answer](http://stackoverflow.com/a/30466795/3565255) – CSquare May 27 '15 at 11:29