I have dataset with categorical and non categorical values. I applied OneHotEncoder for categorical values and StandardScaler for continues values.
transformerVectoriser = ColumnTransformer(transformers=[('Vector Cat', OneHotEncoder(handle_unknown = "ignore"), ['A', 'B', 'C']),
('StandardScaler', StandardScaler(), ['D', 'E'])],
remainder='passthrough') # Default is to drop untransformed columns
Now I want to do cross-validation of my model, but the question is, should I transform my features and how can I do that?
I mean, I need to transform my data because thats the only way to handle categorical values.
I know that I should fit_transform
my training data and only transform
my test data, but how can I manage that in cross validation?
For now, I did this:
features = transformerVectoriser.fit_transform(features)
clf = RandomForestClassifier()
cv_score = cross_val_score(clf, features, results, cv=5)
print(cv_score)
But I think this is not correct because fit_transform
will be applied in test fold and in train fold, and it should be fit_transform
in training set and transform
in test set.
Should I just fit the data, or just transform the data, or something third?