1

I'm trying to regularize my random forest regressor with RandomizedSearchCV. With RandomizedSearchCV the train and test are not explicitly specified, I need to be able to specified my train test set so i can preprocess them after the split. Then i found this helpful QnA and also this. But i still do not know how to do it since in my case, i'm using cross-validation. I already tried to append my train test set from the cross validation but it does not work. It says ValueError: could not broadcast input array from shape (1824,9) into shape (1824) which refers to my X_test

x = np.array(avo_sales.drop(['TotalBags','Unnamed:0','year','region','Date'],1))
y = np.array(avo_sales.TotalBags)

kf = KFold(n_splits=10)

for train_index, test_index in kf.split(x):
    X_train, X_test, y_train, y_test = x[train_index], x[test_index], y[train_index], y[test_index]

impC = SimpleImputer(strategy='most_frequent')
X_train[:,8] = impC.fit_transform(X_train[:,8].reshape(-1,1)).ravel()
X_test[:,8] = impC.transform(X_test[:,8].reshape(-1,1)).ravel()

imp = SimpleImputer(strategy='median')
X_train[:,1:8] = imp.fit_transform(X_train[:,1:8])
X_test[:,1:8] = imp.transform(X_test[:,1:8])

le = LabelEncoder()
X_train[:,8] = le.fit_transform(X_train[:,8])
X_test[:,8] = le.transform(X_test[:,8])

train_indices = X_train, y_test
test_indices = X_test, y_test
my_test_fold = np.append(train_indices, test_indices)
pds = PredefinedSplit(test_fold=my_test_fold)

n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
max_features = ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
min_samples_split = [2, 5, 10]
min_samples_leaf = [1, 2, 4]
bootstrap = [True, False]
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}
rfr = RandomForestRegressor()
rfr_random = RandomizedSearchCV(estimator = rfr , 
                               param_distributions = random_grid,
                               n_iter = 100,
                               cv = pds, verbose=2, random_state=42, n_jobs = -1) <-- i'll be filling the cv parameter with the predefined split
rfr_random.fit(X_train, y_train)
random student
  • 683
  • 1
  • 15
  • 33
  • 2
    First, your `for train_index, test_index in kf.split(x):` doesn't make sense at all, as you'll overwrite folds during this cycle. Include print into the cycle to understand better what you're doing. Second, to your question, use `cv = kf` and you'll achieve your goal. Fix random seed for reproducibility, – Sergey Bushmanov Mar 17 '20 at 20:48
  • hello, thank you for the answer. But if i remove `for train_index, test_index in kf.split(x):` i cannot be able to preprocess my train test set which needs to be done after splitting it. I need my train test set to be explicitly specified so i can access them to preprocess, – random student Mar 18 '20 at 01:29
  • great question, same issue here – Nathan B Aug 27 '23 at 09:47

1 Answers1

0

I think your best option is to use a Pipeline plus a ColumnTransformer. Pipelines allow you to specify several steps of computations, including pre-/post-processing, and the column transformer applies different transformations to different columns. In your case, that would be something like:

pipeline = make_pipeline([
    make_column_transformer([
        (SimpleImputer(strategy='median'), range(1, 8)),
        (make_pipeline([
            SimpleImputer(strategy='most_frequent'),
            LabelEncoder(),
        ]), 8)
    ]),
    RandomForestRegressor()
])

Then you use this model as a normal estimator, with the usual fit and predict API. In particular, you give this to the randomized search:

rfr_random = RandomizedSearchCV(estimator = pipeline, ...)

Now the pre-processing steps will be applied to each split, before fitting the random forest.

This will certainly not work without further adaptations, but hopefully you get the idea.

BlackBear
  • 22,411
  • 10
  • 48
  • 86