I'm running RandomForestRegressor(). I'm using R-squared for scoring. Why do I get dramatically different results with .score versus cross_val_score? Here is the relevant code:
X = df.drop(['y_var'], axis=1)
y = df['y_var']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
# Random Forest Regression
rfr = RandomForestRegressor()
model_rfr = rfr.fit(X_train,y_train)
pred_rfr = rfr.predict(X_test)
result_rfr = model_rfr.score(X_test, y_test)
# cross-validation
rfr_cv_r2 = cross_val_score(rfr, X, y, cv=5, scoring='r2')
I understand that cross-validation is scoring multiple times versus one for .score, but the results are so radically different, that something is clearly wrong. Here are the results:
R2-dot-score: .99072
R2-cross-val: [0.5349302 0.65832268 0.52918704 0.74957719 0.45649582]
What am I doing wrong? Or what might explain this discrepancy?
EDIT:
OK, I may have solved this. It seems as if cross_val_score does not shuffle the data, which may be leading to worse predictions when data is grouped together. The easiest solution I found (via this answer) to this was to simply shuffle the dataframe before running the model:
shuffled_df = df.reindex(np.random.permutation(df.index))
After I did that, I started getting similar results between .score and cross_val_score:
R2-dot-score: 0.9910715555903232
R2-cross-val: [0.99265184 0.9923142 0.9922923 0.99259524 0.99195022]