3

I have cleaned and prepared a data set to be modeled in 4 different regression types - Linear, Lasso, Ridge, and Random Forest DT.

The problem lies in the Linear Regression model. When running CV in k = 5 I get:

linreg = LinearRegression()
linreg.fit(X_train, y_train)

y_pred = linreg.predict(X_test)
cv_scores_linreg = cross_val_score(linreg, X_train, y_train, cv=5)


print("R^2: {}".format(linreg.score(X_test, y_test)))
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("Root Mean Squared Error: {}".format(rmse))
print("Mean 5-Fold CV Score: {}".format(np.mean(cv_scores_linreg)))

print(cv_scores_linreg)

Which prints me a score:

R^2: 0.40113615279035175
Root Mean Squared Error: 0.7845007237654832
Mean 5-Fold CV Score: -8.07591739989044e+19
[ 3.70497335e-01 -9.07945703e+19  3.38625853e-01  3.38206306e-01
 -3.13001300e+20]

For my Random Forest I use:

rf_reg = RandomForestRegressor()    
rf_reg.fit(X_train, y_train)

y_pred_rf = rf_reg.predict(X_test)

cv_scores_rf = cross_val_score(rf_reg, X_train, y_train, cv=5)
print("R^2: {}".format(rf_reg.score(X_test, y_test)))
rmse = np.sqrt(mean_squared_error(y_test, y_pred_rf))
print("Root Mean Squared Error: {}".format(rmse))
print("Mean 5-Fold CV Score: {}".format(np.mean(cv_scores_rf)))

print(cv_scores_rf)

Which gives:

R^2: 0.42158777391603736
Root Mean Squared Error: 0.770988735248686
Mean 5-Fold CV Score: 0.3894909330419569
[0.39982241 0.39516204 0.37037191 0.38400655 0.39809175]

I can't understand why all of my other models give me values similar to Random Forest. The only outlier is the Linear model. When I change k = 10, 20, 30, etc, there seems to be 1 new huge score value per +10 to k.

I've taken out all null data, empty spaces, and have put my data into logarithmic scale to normalize it all. What can be going wrong when it's only the Linear model producing issues?

HelloToEarth
  • 2,027
  • 3
  • 22
  • 48
  • Possible duplicate of [scikit-learn cross validation, negative values with mean squared error](https://stackoverflow.com/questions/21443865/scikit-learn-cross-validation-negative-values-with-mean-squared-error) – Gambit1614 Jun 25 '18 at 16:27
  • Possibly, but it's not the sign of the value I'm concerned with it's the exponential to the 19 and 20 power. I have no idea where this is coming from. – HelloToEarth Jun 25 '18 at 16:48
  • You should try reproducing the example in cross_val_score (by using a reproducible `cv` there to check scores on test fold each time) or post the data that duplicates the behaviour here. Without the actual data, we cannot help – Vivek Kumar Jun 26 '18 at 04:29
  • Yes, as @VivekKumar says, without data this is not easy to tell. Check the [assumptions behind the linear model](http://r-statistics.co/Assumptions-of-Linear-Regression.html) and check if your data fulfills them. Most importantly check, if you have highly correlated variables. If so, drop one of them or try using Ridge Regression which is a penalized Linear Regression that can handel collinearity. – Marcus V. Jun 26 '18 at 06:22

1 Answers1

0

I have faced the same problem. Solved it by using Ridge regression instead of simple linear regression.