1

I am trying to implement a random forest regressor using scikit-learn pipes and nested cross-validation. The dataset is about housing prices, with several features (some numeric other categorical) and a continuous target variable (median_house_value).

Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 

I decided to manually create two stratified 5-folds splits (inner, outer loop for nested cv). The stratification is based upon a modified version of the median_income feature:

df.insert(9, "income_cat", 
                  pd.cut(df["median_income"],bins=[0., 1.5, 3.0, 4.5, 6., np.inf], labels=[1,2,3,4,5]))

This is the code for the splitting folds

cv1_5 = StratifiedShuffleSplit(n_splits = 5, test_size = .2, random_state = 42)
cv1_splits = []

# create first 5 stratified folds indices
for train_index, test_index in cv1_5.split(df, df["income_cat"]):
    cv1_splits.append((train_index, test_index))

cv2_5 = StratifiedShuffleSplit(n_splits = 5, test_size = .2, random_state = 43)
cv2_splits = []

# create second 5 stratified folds indices
for train_index, test_index in cv2_5.split(df, df["income_cat"]):
    cv2_splits.append((train_index, test_index))
    
# set initial dataset
X = df.drop("median_house_value", axis=1)
y = df["median_house_value"].copy()

This is the preprocess pipe

# create preprocess pipe
preprocess_pipe = Pipeline(
    [
        ("ctransformer", ColumnTransformer([
                ( 
                    "num_pipe", 
                    Pipeline([
                        ("imputer", SimpleImputer(strategy="median")),
                        ("scaler", StandardScaler())
                    ]), 
                    list(X.select_dtypes(include=[np.number]))
                ),
                ( 
                    "cat_pipe", 
                    Pipeline([
                        ("encoder", OneHotEncoder()),
                    ]), 
                    ["ocean_proximity"])
            ])
        ),
    ]
)

And this is the final pipe (including the preprocess one)

pipe = Pipeline([
    ("preprocess", preprocess_pipe),
    ("model", RandomForestRegressor())
])

I am using nested cross validation to tune the hyperparameters of the final pipe and compute the generalization error

Here is the parameter grid

param_grid = [
    {
        "preprocess__ctransformer__num_pipe__imputer__strategy": ["mean","median"],
        "model__n_estimators": [3, 10, 30, 50, 100, 150, 300], "model__max_features": [2,4,6,8]
    }
]

This is the final step

grid_search = GridSearchCV(pipe, param_grid, cv = cv1_splits, 
    scoring = "neg_mean_squared_error", 
    return_train_score = True)

clf = grid_search.fit(X, y)

generalization_error = cross_val_score(clf.best_estimator_, X = X, y = y, cv = cv2_splits)
generalization_error

Now, here comes the glitch (bottom two lines of preceding code snippet):

If I follow scikit-learn instructions (link), I should write:

generalization_error = cross_val_score(clf, X = X, y = y, cv = cv2_splits, scoring = "neg_mean_squared_error")
    generalization_error

Unfortunately calling cross_val_score(clf, X = X...) gives me an error (indices are out of bound for the train/test splits) and the generalization error array contains only NaNs.

On the other hand, if I write like so:

generalization_error = cross_val_score(clf.best_estimator_, X = X, y = y, cv = cv2_splits, scoring = "neg_mean_squared_error")
        generalization_error

The script runs flawlessly and I am able to see the generalization error array filled with scores. Can I stick to the last way of doing things, or there is something wrong in the whole process?

1 Answers1

0

For me the problem here can be in the use of cv1_splits and cv2_splits, rather than cv1_5 and cv2_5 (in particular it is the use of cv1_splits to cause the issue).

In general, cross_val_score() calls fit() on a clone of the clf estimator; in such a case it is a GridSearchCV estimator to be fitted onto several X_inner_train sets (subsets of X stratified according to cv1_splits, same dimension of X - see here for notation). Being cv1_splits built from X, it contains indices which are consistent with respect to X dimension, but which might not be consistent with respect to X_inner_train dimension.

Instead, by passing cv1_5 to the GridSearchCV estimator, the estimator itself takes care of splitting the inner training sets coherently (see here for reference).

amiola
  • 2,593
  • 1
  • 11
  • 25