How to match test columns with train data?

Question

Getting an error while trying to use naive bayes.

from sklearn.naive_bayes import GaussianNB
import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/sjwhitworth/golearn/master/examples/datasets/tennis.csv')

X_train = pd.get_dummies(df[['outlook', 'temp', 'humidity', 'windy']])
y_train = df['play']

gNB = GaussianNB()
gNB.fit(X_train, y_train)

ndf=pd.DataFrame({'outlook':['sunny'], 'temp':['hot'], 'humidity':['normal'], 'windy':[False]})
X_test=pd.get_dummies(ndf[['outlook', 'temp', 'humidity', 'windy']])

gNB.predict(X_test)

ValueError: operands could not be broadcast together with shapes (1,4) (9,)

Is it a good idea to use get_dummies method in this case?

No. `get_dummies` will only make those many columns as distinct values present in data at that time. Which most cases will not be similar to train data. So use `LabelEncoder + OneHotEncoder` in this case. Or if you can use the development version of scikit from github, use the CategoricalEncoder present. Please look at [my answer here](https://stackoverflow.com/a/48079345/3374996) — Vivek Kumar, Jun 11 '18 at 07:57
Or else, if you want to use `get_dummies()`, use it on the whole data before splitting into train test. But that will not be possible (feasible) in real-life, or when deployed on production. — Vivek Kumar, Jun 11 '18 at 07:58
Possible duplicate of [Scikit Learn OneHotEncoder fit and transform Error: ValueError: X has different shape than during fitting](https://stackoverflow.com/questions/48074462/scikit-learn-onehotencoder-fit-and-transform-error-valueerror-x-has-different) — E_net4, Jun 11 '18 at 17:57

score 1 · Accepted Answer · answered Jun 11 '18 at 18:31

Obviously not a good practice as pointed by vivek but you here is the code if you want to do anyway:

from sklearn.naive_bayes import GaussianNB
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/sjwhitworth/golearn/master/examples/datasets/tennis.csv')

X_train = pd.get_dummies(df[['outlook', 'temp', 'humidity', 'windy']])
y_train = df['play']

gNB = GaussianNB()
gNB.fit(X_train, y_train)

ndf=pd.DataFrame({'outlook':['sunny'], 'temp':['hot'], 'humidity':['normal'], 'windy':[False]})
X_test=pd.get_dummies(ndf[['outlook', 'temp', 'humidity', 'windy']])

dict1 = {}
X_test.columns
for i in X_train.columns:
  if i in X_test.columns:
    dict1.update({i:[1]})
  else:
    dict1.update({i:[0]})
X_test_new = pd.DataFrame(data = dict1)


gNB.predict(X_test_new)

How to match test columns with train data?

1 Answers1