0

I am starting to use a MultiOutputRegressor in sci-kit learn for a multi-variable target I am trying to estimate with Random Forests.

I did start implementing manually before I came across this MultiOutputRegressor, and was trying to rotate the output for single output regressors so that a single target was used at any given time - and the other target variables used as inputs - but it was becoming computationally expensive.

I have searched and reviewed some code, but am struggling to determine if the target output (y) is used as an input feature (X). Specifically:

  • when y_1 is being predicted, are y_2 ... y_n used as input features?
  • when y_x is being predicted, are y_1 ... y_n (excluding y_x) used as input features?
  • when y_n is being predicted, are y_1 ... y_n-1 used as input features? (apologies if I'm being overly verbose)

The paper "Multi-target regression via input space expansion" explains what I am looking to achieve.

Some answers have alluded to the fact that the MultiOutputRegressor algorithm may look for correlations between the target values, but I'm hoping they're actually rotated to be inputs (or effective inputs) for the algorithm in my application.

StupidWolf
  • 45,075
  • 17
  • 40
  • 72
aliveman
  • 165
  • 8

2 Answers2

1

Looking at the source code function, at def fit(self, X, Y, **fit_params), it seems to separate the response and fit them individually. Also since it can take in a lot of regressor / classifier.. I cannot imagine estimating this relationship between outputs to encompass so many models.

If you are looking for something that considers the relationship between response variables, you can check out this post, using a Guassian process

Below is an example using linear regression to show that the coefficients are similar under both multi or iterating through individual output:

from sklearn.multioutput import MultiOutputRegressor
from sklearn.linear_model import LinearRegression

np.random.seed(111)

mean = [0, 2]
cov = [[1, 0.3], [0.3, 3]]  

y = np.random.multivariate_normal(mean, cov, 100)
X = np.random.normal(0,1,(100,2))

regr_multi = MultiOutputRegressor(LinearRegression())
regr_multi.fit(X, y)

regr_list = [LinearRegression().fit(X,y[:,i]) for i in range(y.shape[1])]

print(regr_multi.estimators_[0].coef_ , regr_list[0].coef_)

[-0.04355358 -0.03379101] [-0.04355358 -0.03379101]

print(regr_multi.estimators_[1].coef_ , regr_list[1].coef_)

[ 0.2921806 -0.1372799] [ 0.2921806 -0.1372799]
StupidWolf
  • 45,075
  • 17
  • 40
  • 72
  • Great stuff! I am quite unsure... if you could help: I could understand that the "Simple method - `MultiOutputRegressor`" as it doesn't consider other targets while training, can't explain the relationship between the targets. But the `RegressiorChain` should be able to explain the relationship, isn't it? As it is being trained on the other targets as well (partially though). Secondly, from both of them, which is more likely to result in overfitting? Are there chances for first method to overfit more than the second one? Thanks in advance! – Aayush Shah Jul 04 '22 at 11:31
0

If I understand correctly, the method you describe in your post doesn't seem to be either of the methods in your linked paper.

The first method in the paper, Stacked Single Target, can probably be accomplished with the StackingRegressor, although it would take a bit of hacking so that the base models are passing forward only predictions for their assigned output.

The second method in the paper, Ensemble of Regressor Chains, should be pretty straightforward with the RegressorChain class, with cv set, then ensembled over multiple orders.

The method you describe doesn't seem to be available as a sklearn builtin, though it shouldn't be too hard to just loop through and fit individual models?

Ben Reiniger
  • 10,517
  • 3
  • 16
  • 29