How to use predefined split for RandomizedSearchCV

Question

I'm trying to regularize my random forest regressor with RandomizedSearchCV. With RandomizedSearchCV the train and test are not explicitly specified, I need to be able to specified my train test set so i can preprocess them after the split. Then i found this helpful QnA and also this. But i still do not know how to do it since in my case, i'm using cross-validation. I already tried to append my train test set from the cross validation but it does not work. It says ValueError: could not broadcast input array from shape (1824,9) into shape (1824) which refers to my X_test

x = np.array(avo_sales.drop(['TotalBags','Unnamed:0','year','region','Date'],1))
y = np.array(avo_sales.TotalBags)

kf = KFold(n_splits=10)

for train_index, test_index in kf.split(x):
    X_train, X_test, y_train, y_test = x[train_index], x[test_index], y[train_index], y[test_index]

impC = SimpleImputer(strategy='most_frequent')
X_train[:,8] = impC.fit_transform(X_train[:,8].reshape(-1,1)).ravel()
X_test[:,8] = impC.transform(X_test[:,8].reshape(-1,1)).ravel()

imp = SimpleImputer(strategy='median')
X_train[:,1:8] = imp.fit_transform(X_train[:,1:8])
X_test[:,1:8] = imp.transform(X_test[:,1:8])

le = LabelEncoder()
X_train[:,8] = le.fit_transform(X_train[:,8])
X_test[:,8] = le.transform(X_test[:,8])

train_indices = X_train, y_test
test_indices = X_test, y_test
my_test_fold = np.append(train_indices, test_indices)
pds = PredefinedSplit(test_fold=my_test_fold)

n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
max_features = ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
min_samples_split = [2, 5, 10]
min_samples_leaf = [1, 2, 4]
bootstrap = [True, False]
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}
rfr = RandomForestRegressor()
rfr_random = RandomizedSearchCV(estimator = rfr , 
                               param_distributions = random_grid,
                               n_iter = 100,
                               cv = pds, verbose=2, random_state=42, n_jobs = -1) <-- i'll be filling the cv parameter with the predefined split
rfr_random.fit(X_train, y_train)

First, your `for train_index, test_index in kf.split(x):` doesn't make sense at all, as you'll overwrite folds during this cycle. Include print into the cycle to understand better what you're doing. Second, to your question, use `cv = kf` and you'll achieve your goal. Fix random seed for reproducibility, — Sergey Bushmanov, Mar 17 '20 at 20:48
hello, thank you for the answer. But if i remove `for train_index, test_index in kf.split(x):` i cannot be able to preprocess my train test set which needs to be done after splitting it. I need my train test set to be explicitly specified so i can access them to preprocess, — random student, Mar 18 '20 at 01:29

score 0 · Answer 1 · answered Mar 19 '20 at 15:19

I think your best option is to use a Pipeline plus a ColumnTransformer. Pipelines allow you to specify several steps of computations, including pre-/post-processing, and the column transformer applies different transformations to different columns. In your case, that would be something like:

pipeline = make_pipeline([
    make_column_transformer([
        (SimpleImputer(strategy='median'), range(1, 8)),
        (make_pipeline([
            SimpleImputer(strategy='most_frequent'),
            LabelEncoder(),
        ]), 8)
    ]),
    RandomForestRegressor()
])

Then you use this model as a normal estimator, with the usual fit and predict API. In particular, you give this to the randomized search:

rfr_random = RandomizedSearchCV(estimator = pipeline, ...)

Now the pre-processing steps will be applied to each split, before fitting the random forest.

This will certainly not work without further adaptations, but hopefully you get the idea.

How to use predefined split for RandomizedSearchCV

1 Answers1