I am trying to implement a random forest regressor using scikit-learn pipes and nested cross-validation. The dataset is about housing prices, with several features (some numeric other categorical) and a continuous target variable (median_house_value).
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 longitude 20640 non-null float64
1 latitude 20640 non-null float64
2 housing_median_age 20640 non-null float64
3 total_rooms 20640 non-null float64
4 total_bedrooms 20433 non-null float64
5 population 20640 non-null float64
6 households 20640 non-null float64
7 median_income 20640 non-null float64
8 median_house_value 20640 non-null float64
9 ocean_proximity 20640 non-null object
I decided to manually create two stratified 5-folds splits (inner, outer loop for nested cv). The stratification is based upon a modified version of the median_income feature:
df.insert(9, "income_cat",
pd.cut(df["median_income"],bins=[0., 1.5, 3.0, 4.5, 6., np.inf], labels=[1,2,3,4,5]))
This is the code for the splitting folds
cv1_5 = StratifiedShuffleSplit(n_splits = 5, test_size = .2, random_state = 42)
cv1_splits = []
# create first 5 stratified folds indices
for train_index, test_index in cv1_5.split(df, df["income_cat"]):
cv1_splits.append((train_index, test_index))
cv2_5 = StratifiedShuffleSplit(n_splits = 5, test_size = .2, random_state = 43)
cv2_splits = []
# create second 5 stratified folds indices
for train_index, test_index in cv2_5.split(df, df["income_cat"]):
cv2_splits.append((train_index, test_index))
# set initial dataset
X = df.drop("median_house_value", axis=1)
y = df["median_house_value"].copy()
This is the preprocess pipe
# create preprocess pipe
preprocess_pipe = Pipeline(
[
("ctransformer", ColumnTransformer([
(
"num_pipe",
Pipeline([
("imputer", SimpleImputer(strategy="median")),
("scaler", StandardScaler())
]),
list(X.select_dtypes(include=[np.number]))
),
(
"cat_pipe",
Pipeline([
("encoder", OneHotEncoder()),
]),
["ocean_proximity"])
])
),
]
)
And this is the final pipe (including the preprocess one)
pipe = Pipeline([
("preprocess", preprocess_pipe),
("model", RandomForestRegressor())
])
I am using nested cross validation to tune the hyperparameters of the final pipe and compute the generalization error
Here is the parameter grid
param_grid = [
{
"preprocess__ctransformer__num_pipe__imputer__strategy": ["mean","median"],
"model__n_estimators": [3, 10, 30, 50, 100, 150, 300], "model__max_features": [2,4,6,8]
}
]
This is the final step
grid_search = GridSearchCV(pipe, param_grid, cv = cv1_splits,
scoring = "neg_mean_squared_error",
return_train_score = True)
clf = grid_search.fit(X, y)
generalization_error = cross_val_score(clf.best_estimator_, X = X, y = y, cv = cv2_splits)
generalization_error
Now, here comes the glitch (bottom two lines of preceding code snippet):
If I follow scikit-learn instructions (link), I should write:
generalization_error = cross_val_score(clf, X = X, y = y, cv = cv2_splits, scoring = "neg_mean_squared_error")
generalization_error
Unfortunately calling cross_val_score(clf, X = X...) gives me an error (indices are out of bound for the train/test splits) and the generalization error array contains only NaNs.
On the other hand, if I write like so:
generalization_error = cross_val_score(clf.best_estimator_, X = X, y = y, cv = cv2_splits, scoring = "neg_mean_squared_error")
generalization_error
The script runs flawlessly and I am able to see the generalization error array filled with scores. Can I stick to the last way of doing things, or there is something wrong in the whole process?