I would like to conduct a randomized search for optimal parameters with XGB with additional augmented data in the train set of each fold of a cv. My current way is to add the augmented data in the split of my cross validation method after getting a random parameter dictionary from ParameterSampler. Like so:
cv = RepeatedStratifiedKFold(
n_splits=5,
n_repeats=5,
random_state=101
)
for params in random_param_samples:
model = XGBClassifier(**params)
for i, (train, test) in enumerate(cv.split(X, y)):
new_X = X.iloc[train,:] + augmented_data.iloc[train,:] # this is simplified just to show the idea
new_y = y.iloc[train] + augmented_data.iloc[train] # this is simplified just to show the idea
model.fit(
new_X,
new_y)
y_pred = model.predict(X.iloc[test,:])
# collect my scores
It is working, but it creates a lot of overhead everywhere.
I was wondering, if there is a smart way to tell RandomizedSearchCV and basically every method in sklearn accepting a cross validation method to train with augmented data and test on original data without augmentations. It would be great to be able to use cross_validate and similar methods with augmented data.
Maybe passing the original data and the augmented data as one DataFrame and tell the cv method that only certain train indices are allowed and only test indices from the original data?