I was trying to understand the sklearn's GridSearchCV. I was having few basic question about the use of cross validation in GridsearchCV and then how shall I use the GridsearchCV 's recommendations further
Say I declare a GridsearchCV instance as below
from sklearn.grid_search import GridSearchCV
RFReg = RandomForestRegressor(random_state = 1)
param_grid = {
'n_estimators': [100, 500, 1000, 1500],
'max_depth' : [4,5,6,7,8,9,10]
}
CV_rfc = GridSearchCV(estimator=RFReg, param_grid=param_grid, cv= 10)
CV_rfc.fit(X_train, y_train)
I had below questions :
Say in first iteration
n_estimators = 100
andmax_depth = 4
is selected for model building.Now will thescore
for this model be choosen with the help of 10 fold cross-validation ?a. My understanding about the process is as follows
- 1.
X_train
andy_train
will be splitted in to 10 sets. - Model will be trained on 9 sets and tested on 1 remaining set and its score will be stored in a list: say
score_list
- Model will be trained on 9 sets and tested on 1 remaining set and its score will be stored in a list: say
- This process will be repeated 9 more times and each of this 9 scores will be added to the
score_list
to give 10 score in all
- This process will be repeated 9 more times and each of this 9 scores will be added to the
- Finally the average of the score_list will be taken to give a final_score for the model with parameters :
n_estimators = 100
andmax_depth = 4
- Finally the average of the score_list will be taken to give a final_score for the model with parameters :
- 1.
b. The above process will repeated with all other possible combinations of
n_estimators
andmax_depth
and each time we will get a final_score for that modelc. The best model will be the model having highest final_score and we will get corresponding best values of 'n_estimators' and 'max_depth' by
CV_rfc.best_params_
Is my understanding about GridSearchCV
correct ?
- Now say I get best model parameters as
{'max_depth': 10, 'n_estimators': 100}
. I declare an intance of the model as below
RFReg_best = RandomForestRegressor(n_estimators = 100, max_depth = 10, random_state = 1)
I now have two options which of it is correct is what I wanted to know
a. Use cross validation for entire dataset to see how well the model is performing as below
scores = cross_val_score(RFReg_best , X, y, cv = 10, scoring = 'mean_squared_error')
rm_score = -scores
rm_score = np.sqrt(rm_score)
b. Fit the model on X_train, y_train and then test in on X_test, y_test
RFReg_best.fit(X_train, y_train)
y_pred = RFReg.predict(X_test)
rm_score = np.sqrt(mean_squared_error(y_test, y_pred))
Or both of them are correct