I have cleaned and prepared a data set to be modeled in 4 different regression types - Linear, Lasso, Ridge, and Random Forest DT.
The problem lies in the Linear Regression model. When running CV in k = 5 I get:
linreg = LinearRegression()
linreg.fit(X_train, y_train)
y_pred = linreg.predict(X_test)
cv_scores_linreg = cross_val_score(linreg, X_train, y_train, cv=5)
print("R^2: {}".format(linreg.score(X_test, y_test)))
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("Root Mean Squared Error: {}".format(rmse))
print("Mean 5-Fold CV Score: {}".format(np.mean(cv_scores_linreg)))
print(cv_scores_linreg)
Which prints me a score:
R^2: 0.40113615279035175
Root Mean Squared Error: 0.7845007237654832
Mean 5-Fold CV Score: -8.07591739989044e+19
[ 3.70497335e-01 -9.07945703e+19 3.38625853e-01 3.38206306e-01
-3.13001300e+20]
For my Random Forest I use:
rf_reg = RandomForestRegressor()
rf_reg.fit(X_train, y_train)
y_pred_rf = rf_reg.predict(X_test)
cv_scores_rf = cross_val_score(rf_reg, X_train, y_train, cv=5)
print("R^2: {}".format(rf_reg.score(X_test, y_test)))
rmse = np.sqrt(mean_squared_error(y_test, y_pred_rf))
print("Root Mean Squared Error: {}".format(rmse))
print("Mean 5-Fold CV Score: {}".format(np.mean(cv_scores_rf)))
print(cv_scores_rf)
Which gives:
R^2: 0.42158777391603736
Root Mean Squared Error: 0.770988735248686
Mean 5-Fold CV Score: 0.3894909330419569
[0.39982241 0.39516204 0.37037191 0.38400655 0.39809175]
I can't understand why all of my other models give me values similar to Random Forest. The only outlier is the Linear model. When I change k = 10, 20, 30, etc, there seems to be 1 new huge score value per +10 to k.
I've taken out all null data, empty spaces, and have put my data into logarithmic scale to normalize it all. What can be going wrong when it's only the Linear model producing issues?