ML Classification : Encoding categorical data

Question

I am a beginner at this,

I have a classification problem and my data looks like below:

and so on...

Result column is dependent variable. None of the data is Ordinal. (Name column is having 36 different names.)

As it is categorical data i tried OneHotEncoding and i got ValueError: Number of features of the model must match the input

Which i understood and referred this : SO Question and it got fixed.

Also there was another site : Medium to solve this ValueError by using Pandas factorize function.

My Question is:

what is the correct way to approach this? Should i factorize and apply OneHotEncoding ?
or Since my data is not Ordinal i shouldn't use factorize?
I am always getting 100% accuracy. Is it because of the encoding i do ?

My code below:

Training

# -*- coding: utf-8 -*-

import numpy as np

import pandas as pd
dataset = pd.read_csv("model_data.csv")


dataset['Col1'] = pd.factorize(dataset['Col1'])[0]
dataset['Col2'] = pd.factorize(dataset['Col2'])[0]
dataset['name'] = pd.factorize(dataset['name'])[0]
dataset['ID'] = pd.factorize(dataset['ID'])[0]

X = dataset.iloc[:, 0:-1].values
y = dataset.iloc[:, -1].values

# Encoding
# Encoding categorical data
# Encoding the Independent Variable
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder

ct = make_column_transformer((OneHotEncoder(sparse='False'), [0,1,2,3]),  remainder = 'passthrough')
X = ct.fit_transform(X)


# Encoding the Dependent Variable
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)
print(y)

#
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)



from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 5, criterion = 'entropy', max_depth = 5, random_state = 0)
classifier.fit(X_train, y_train)

# Predicting the Test set results
y_pred = classifier.predict(X_test)

Testing

test_data_set =  pd.read_csv("test_data.csv")


test_data_set['Col1'] = pd.factorize(test_data_set['Col1'])[0]
test_data_set['Col2'] = pd.factorize(test_data_set['Col2'])[0]
test_data_set['name'] = pd.factorize(test_data_set['name'])[0]
test_data_set['ID'] = pd.factorize(test_data_set['ID'])[0]

X_test_data = test_data_set.iloc[:, 0:-1].values
y_test_data = test_data_set.iloc[:, -1].values


y_test_data = le.transform(y_test_data)


classifier.fit(X_test_data, y_test_data) #fixes ValueError
y_test_pred = classifier.predict(X_test_data)

from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test_data, y_test_pred)
print(cm)
print(accuracy_score(y_test_data, y_test_pred))

EDIT:

Number of rows in my dataset is 2000.
The result accuracy_score is 1.0

Confusion Matrix

[[113   0]

 [  0  30]]

I am not sure i have around 2000 rows but my TP and TN together has only 143 counts.

How many rows do you have in yout dataset? Also, dont use fit_transform on your test_data. And also add the results generated in your question. — Aniket Bote, Sep 02 '20 at 05:13
@AniketBote I have edited the question. I have removed the fit_transform — user3164187, Sep 02 '20 at 05:29
Also, add a confusion matrix. Also, make sure that test data and train data don't contain the same samples. The length of the test data should be at least 20% of the training set and should be representative of all labels. — Aniket Bote, Sep 02 '20 at 05:30
Don't you fit your test data with `classifier.fit(X_test_data, y_test_data)` and overwrite your previous training data fit? I think you shouldn't do a `fit` with the test data, just a `predict` — Berger, Sep 02 '20 at 05:38
@Berger Yeah i have also never done it before i just tried it to fix **ValueError**. It maybe wrong. — user3164187, Sep 02 '20 at 05:47
@Berger is right if you are calling fit on test data you will get an accuracy of 1.0. Since you are retraining your dataset on that particular data. — Aniket Bote, Sep 02 '20 at 06:35

score 2 · Accepted Answer · answered Sep 02 '20 at 06:32

Here is example of how you can use OneHotEncoding for your data to perform binary classification.

You first use one-hot-encoding on all you columns with features, then factorize your Y/N classes in "Result" column to 1/0 view.

dataset = pd.read_csv("model_data.csv")

dataset = pd.get_dummies(dataset , columns=['Col1', 'Col2', 'name', 'ID'])
dataset.Result = pd.factorize(dataset.Result)[0]

You should get result like it showed below in your resulting dataframe, which you can use for your training/testing steps.

Initial dataframe:

  Col1 Col2     name    ID Result
0   AB    A     John -2500      N
1   AB    A     John -2500      N
2    A    A     John -2500      N
3    A    A    Jacob -2500      Y
4    A    A  Micheal -2500      Y
5    A   AB     John -2500      N
6    A    A  Sheldon -2500      Y
7   AB   AB  Sheldon -2500      N
8   AB   AB    Jacob -2500      Y

Resulting dataframe:


   Result  Col1_A  Col1_AB  Col2_A  Col2_AB  name_Jacob  name_John  name_Micheal  name_Sheldon  ID_-2500
0       0       0        1       1        0           0          1             0             0         1
1       0       0        1       1        0           0          1             0             0         1
2       0       1        0       1        0           0          1             0             0         1
3       1       1        0       1        0           1          0             0             0         1
4       1       1        0       1        0           0          0             1             0         1
5       0       1        0       0        1           0          1             0             0         1
6       1       1        0       1        0           0          0             0             1         1
7       0       0        1       0        1           0          0             0             1         1
8       1       0        1       0        1           1          0             0             0         1

Hope it'll help you.

Thanks! I have trained my model with this and while testing i get **ValueError: Number of features of the model must match the input.** as my test data has less n_features. How to overcome this ? — user3164187, Sep 02 '20 at 07:02
@user3164187, Probably, such question was already answered in this thread. https://stackoverflow.com/questions/44026832/valueerror-number-of-features-of-the-model-must-match-the-input — Oleksii Komarov, Sep 02 '20 at 07:30

score 1 · Answer 2 · answered Sep 02 '20 at 06:15

1

You can use the pd.get_dummies() method, it's usually pretty reliable. This guide should get you started. Cheers!

answered Sep 02 '20 at 06:15

Arjun Sohanlal

21
3

My data is Nominal, will the Model consider it as Ordinal if i use get_dummies() ? – user3164187 Sep 02 '20 at 06:16
1

@user3164187 `get_dummies()` is typically meant for nominal data where you'd like to one-hot encode them; if you've got ordinal data, then you'd rather stay away from it, or you could use Quantile-based discretization using `pd.qcut()` and make new columns specifying if your data is within a cut-off range or not. For e.g., if a feature ranges between 0-100, pd.qcut() could make 4 new columns - '0-25', -26-50', '51-75' and '76-100' - with 0s and 1s indicating if that example falls within that range or not. – Arjun Sohanlal Sep 02 '20 at 06:34

ML Classification : Encoding categorical data

2 Answers2