My train and test data set that are two seperate csv files.
I've done some feature engineering on the test set and have used pd_get_dummies() which works as expected.
Training Classes
|Condition|
-----------
Poor
Ok
Good
Excelent
My issue is that the there is a mismatch when I try to predict the values as the test set has a different amount of columns after pd.get_dummies()
Test set:
|Condition|
-----------
Poor
Ok
Good
Notice that Excelent is missing!! And over all the columns after creating dummies i'm about 20 columns short of the training dataframe.
My question is it acceptable to join the train.csv and test.csv - run all my feature engineering, scaling etc and then split back into the two dataframes before the training phase?
Or is there another better solution?