I am a beginner at this,
I have a classification problem and my data looks like below:
Result column is dependent variable. None of the data is Ordinal. (Name column is having 36 different names.)
As it is categorical data i tried OneHotEncoding
and i got ValueError: Number of features of the model must match the input
Which i understood and referred this : SO Question and it got fixed.
Also there was another site : Medium to solve this ValueError
by using Pandas factorize
function.
My Question is:
- what is the correct way to approach this? Should i
factorize
and applyOneHotEncoding
? - or Since my data is not Ordinal i shouldn't use factorize?
- I am always getting 100% accuracy. Is it because of the encoding i do ?
My code below:
Training
# -*- coding: utf-8 -*-
import numpy as np
import pandas as pd
dataset = pd.read_csv("model_data.csv")
dataset['Col1'] = pd.factorize(dataset['Col1'])[0]
dataset['Col2'] = pd.factorize(dataset['Col2'])[0]
dataset['name'] = pd.factorize(dataset['name'])[0]
dataset['ID'] = pd.factorize(dataset['ID'])[0]
X = dataset.iloc[:, 0:-1].values
y = dataset.iloc[:, -1].values
# Encoding
# Encoding categorical data
# Encoding the Independent Variable
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder
ct = make_column_transformer((OneHotEncoder(sparse='False'), [0,1,2,3]), remainder = 'passthrough')
X = ct.fit_transform(X)
# Encoding the Dependent Variable
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)
print(y)
#
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 5, criterion = 'entropy', max_depth = 5, random_state = 0)
classifier.fit(X_train, y_train)
# Predicting the Test set results
y_pred = classifier.predict(X_test)
Testing
test_data_set = pd.read_csv("test_data.csv")
test_data_set['Col1'] = pd.factorize(test_data_set['Col1'])[0]
test_data_set['Col2'] = pd.factorize(test_data_set['Col2'])[0]
test_data_set['name'] = pd.factorize(test_data_set['name'])[0]
test_data_set['ID'] = pd.factorize(test_data_set['ID'])[0]
X_test_data = test_data_set.iloc[:, 0:-1].values
y_test_data = test_data_set.iloc[:, -1].values
y_test_data = le.transform(y_test_data)
classifier.fit(X_test_data, y_test_data) #fixes ValueError
y_test_pred = classifier.predict(X_test_data)
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test_data, y_test_pred)
print(cm)
print(accuracy_score(y_test_data, y_test_pred))
EDIT:
- Number of rows in my dataset is 2000.
- The result
accuracy_score
is 1.0
Confusion Matrix
[[113 0]
[ 0 30]]
I am not sure i have around 2000 rows but my TP and TN together has only 143 counts.