I have a factor field Gender in a dataframe named dataset. As per my knowledge factors are similar to enumerations in C , that is each name is mapped to a number .
> dataset$Gender <- as.factor(dataset$Gender)
> str(dataset$Gender)
Factor w/ 2 levels "Female","Male": 2 2 1 1 2 2 1 1 2 1 ...
Now while performing K-Nearest Neighbour if i pass this field as independent variable it throws an error.
Now if i provide a label to this factor field all things go well :-
> dataset$Gender <- factor(dataset$Gender , levels = c("Female","Male") , labels = c(0,1))
> str(dataset$Gender)
Factor w/ 2 levels "0","1": 2 2 1 1 2 2 1 1 2 1 ...
What change did label make. Did it provide some numerical weight to male and female which helped in calculating Euclidean distance. If so , why did'nt the Euclidean distance got calculated on the mapping done by factors itself which is Female : 1 and Male : 2 when no labels were provided . Why did'nt this mapping of Female : 1 and Male : 2 worked in Euclidean distance calculation.
Dataset
> head(dataset)
User.ID Gender Age EstimatedSalary Purchased
1 15624510 Male 19 19000 0
2 15810944 Male 35 20000 0
3 15668575 Female 26 43000 0
4 15603246 Female 27 57000 0
5 15804002 Male 19 76000 0
6 15728773 Male 27 58000 0
Example With Error
dataset <- read.csv("~/Desktop/Machine Learning /ML_16/Social_Network_Ads.csv")
library(caTools)
set.seed(1231)
sample_split <- sample.split(dataset$Gender , SplitRatio = 0.8)
training_dataset <- subset(dataset , sample_split == TRUE)
testing_dataset <- subset(dataset , sample_split == FALSE)
library(class)
model_classifier <- knn(train = training_dataset[,-5] , test = testing_dataset[,-5] , cl = training_dataset$Purchased , k = 21 )
library(caret)
confusionMatrix(table(model_classifier , testing_dataset$Purchased))
Error
Error in knn(train = training_dataset[, -5], test = testing_dataset[, :
NA/NaN/Inf in foreign function call (arg 6)
In addition: Warning messages:
1: In knn(train = training_dataset[, -5], test = testing_dataset[, :
NAs introduced by coercion
2: In knn(train = training_dataset[, -5], test = testing_dataset[, :
NAs introduced by coercion
After Assigning Labels
dataset <- read.csv("~/Desktop/Machine Learning /ML_16/Social_Network_Ads.csv")
dataset$Gender <- factor(dataset$Gender , levels = c("Female","Male") , labels = c(0 , 1))
library(caTools)
set.seed(1231)
sample_split <- sample.split(dataset$Gender , SplitRatio = 0.8)
training_dataset <- subset(dataset , sample_split == TRUE)
testing_dataset <- subset(dataset , sample_split == FALSE)
library(class)
model_classifier <- knn(train = training_dataset[,-5] , test = testing_dataset[,-5] , cl = training_dataset$Purchased , k = 21 )
library(caret)
confusionMatrix(table(model_classifier , testing_dataset$Purchased))