1

My train and test data set that are two seperate csv files.

I've done some feature engineering on the test set and have used pd_get_dummies() which works as expected.

Training Classes

|Condition|
-----------
Poor
Ok
Good
Excelent

My issue is that the there is a mismatch when I try to predict the values as the test set has a different amount of columns after pd.get_dummies()

Test set:

|Condition|
-----------
Poor
Ok
Good

Notice that Excelent is missing!! And over all the columns after creating dummies i'm about 20 columns short of the training dataframe.

My question is it acceptable to join the train.csv and test.csv - run all my feature engineering, scaling etc and then split back into the two dataframes before the training phase?

Or is there another better solution?

Lewis Morris
  • 1,916
  • 2
  • 28
  • 39
  • Are you performing `pd.get_dummies()` on the test set as well? – David Buck Nov 13 '19 at 20:50
  • 1
    https://stats.stackexchange.com/questions/174823/how-to-apply-standardization-normalization-to-train-and-testset-if-prediction-i You should not use any information from the test set, so they need to be kept separate (i.e. the test cannot be used in the scaling of train) – ALollz Nov 13 '19 at 20:54
  • 2
    As an aside there's no solid rule for how to split your data. Given your issues, you seem to be doing a simple X/100-X split, which is fine. However with classes you can ensure representation within groups using StratifiedKFold, for instance. https://scikit-learn.org/stable/auto_examples/model_selection/plot_cv_indices.html#sphx-glr-auto-examples-model-selection-plot-cv-indices-py I think the visualizations at the bottom are helpful – ALollz Nov 13 '19 at 21:01
  • @DavidBuck yes preforming pd.get_dummies() on both train and test sets. – Lewis Morris Nov 13 '19 at 21:06
  • 1
    [Relevant](https://stackoverflow.com/questions/41335718/keep-same-dummy-variable-in-training-and-testing-data) – G. Anderson Nov 13 '19 at 21:10
  • @G.Anderson thanks thats perfect for me – Lewis Morris Nov 14 '19 at 05:40

1 Answers1

1

It is acceptable to join the train and test as you say, but I would not recommend that.

Particularly, because when you deploy a model and you start scoring "real data" you don't get the chance to join it back to the train set to produce the dummy variables.

There are alternative solutions using the OneHotEncoder class from either Scikit-learn, Feature-engine or Category encoders. All these are open source python packages, with classes that implement the fit / transform functionality.

With fit, the class learns the dummy variables that will be created from the train set, and with trasnform it creates the dummy variables. In the example that you provide, the test set will also have 4 dummies, and the dummy "Excellent" will contain all 0.

Find examples of the OneHotEncoder from Scikit-learn, Feature-engine and Category encoders in the provided links

Sole Galli
  • 827
  • 6
  • 21