0

I have dataset with categorical and non categorical values. I applied OneHotEncoder for categorical values and StandardScaler for continues values.

transformerVectoriser = ColumnTransformer(transformers=[('Vector Cat', OneHotEncoder(handle_unknown = "ignore"), ['A', 'B', 'C']),
                                                        ('StandardScaler', StandardScaler(), ['D', 'E'])],
                                          remainder='passthrough') # Default is to drop untransformed columns

Now I want to do cross-validation of my model, but the question is, should I transform my features and how can I do that? I mean, I need to transform my data because thats the only way to handle categorical values. I know that I should fit_transform my training data and only transform my test data, but how can I manage that in cross validation?

For now, I did this:

features = transformerVectoriser.fit_transform(features)

clf = RandomForestClassifier()
cv_score = cross_val_score(clf, features, results, cv=5)
print(cv_score)

But I think this is not correct because fit_transform will be applied in test fold and in train fold, and it should be fit_transform in training set and transform in test set. Should I just fit the data, or just transform the data, or something third?

taga
  • 3,537
  • 13
  • 53
  • 119
  • 1
    You should wrap your feature transformations and model in a [pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) (preferred way); or do it manually as I show [here](https://stackoverflow.com/questions/54201464/cross-validation-metrics-in-scikit-learn-for-each-data-split/54202609#54202609) (keep in mind that `cross_val_score` does not shuffle the data, which can be an issue). – desertnaut Jun 10 '21 at 19:15

1 Answers1

2

desertnaut already teased the answer in his comment. I shall just explicate and complete:

When you want to cross-validate several data processing steps together with an estimator, the best way is to use Pipeline objects. According to the user guide, a Pipeline serves multiple purposes, one of them being safety:

Pipelines help avoid leaking statistics from your test data into the trained model in cross-validation, by ensuring that the same samples are used to train the transformers and predictors.

With your definitions like above, you would wrap your transformations and classifier in a Pipeline the following way:

from sklearn.pipeline import Pipeline


pipeline = Pipeline([
    ('transformer', transformerVectoriser),
    ('classifier', clf)
])

The steps in the pipeline can now be cross-validated togehter:

cv_score = cross_val_score(pipeline, features, results, cv=5)
print(cv_score)

This will ensure that all transformers and the final estimator in the pipeline are only fit and transformed according to the training data, and only call the transform and predict methods on the test data in each iteration.

If you want to read up more on the usage of Pipeline, check the documentation.

afsharov
  • 4,774
  • 2
  • 10
  • 27
  • So bottom line, train data in CV will be fitted and transformed, and test data in CV will be only transformed, right? Because, I couldn't find info about that in the documentation – taga Jun 10 '21 at 22:40
  • @taga yes, correct. In this section [here](https://scikit-learn.org/stable/modules/cross_validation.html#computing-cross-validated-metrics) you can find an example of which behavior is mimicked with a `Pipeline` in the paragraph titled *Data transformation with held out data*. – afsharov Jun 10 '21 at 22:59
  • Good job (didn't have the time to write an answer myself). Additional remark: it's always good practice to shuffle the data before, as `cross_val_score` paradoxically (and in contrast with the rest of the CV functionality in scikit-learn) will not do it (doesn't even contain an option); for examples of how this can hurt or distort the performance, see own answers [here](https://stackoverflow.com/a/54202609/4685471), [here](https://stackoverflow.com/a/61227567/4685471), and [here](https://stackoverflow.com/a/55309222/4685471) cc @taga – desertnaut Jun 11 '21 at 07:23