What change occur when we provide a label to a factor field ? In the following code i have once assigned labels as 0 and 1 and the next time i have assigned labels as 0 and 10^6 . As per my knowledge labels are just providing the alternate name to the categories which in this case are Male and Female. Please note i have provided numeric labels not character labels.
It seems like labels are providing some sort of numeric weight to the categories which change the eucladian distance for a datapoint. Below provided are two codes for the problem with the corresponding results
Dataset
> head(dataset)
User.ID Gender Age EstimatedSalary Purchased
1 15624510 1 19 19000 0
2 15810944 1 35 20000 0
3 15668575 0 26 43000 0
4 15603246 0 27 57000 0
5 15804002 1 19 76000 0
6 15728773 1 27 58000 0
R code with labels = c(0 , 1)
dataset <- read.csv("~/Desktop/Machine Learning /ML_16/Social_Network_Ads.csv")
dataset$Gender <- factor(dataset$Gender , levels = c("Female","Male") , labels = c(0 , 1))
library(caTools)
set.seed(1231)
sample_split <- sample.split(dataset$Gender , SplitRatio = 0.8)
training_dataset <- subset(dataset , sample_split == TRUE)
testing_dataset <- subset(dataset , sample_split == FALSE)
library(class)
model_classifier <- knn(train = training_dataset[,-5] , test = testing_dataset[,-5] , cl = training_dataset$Purchased , k = 21 )
library(caret)
confusionMatrix(table(model_classifier , testing_dataset$Purchased))
Result
Confusion Matrix and Statistics
model_classifier 0 1
0 47 18
1 4 11
Accuracy : 0.725
R code with labels = c(0 , 10^6)
dataset <- read.csv("~/Desktop/Machine Learning /ML_16/Social_Network_Ads.csv")
dataset$Gender <- factor(dataset$Gender , levels = c("Female","Male") , labels = c(0 , 10^6))
library(caTools)
set.seed(1231)
sample_split <- sample.split(dataset$Gender , SplitRatio = 0.8)
training_dataset <- subset(dataset , sample_split == TRUE)
testing_dataset <- subset(dataset , sample_split == FALSE)
library(class)
model_classifier <- knn(train = training_dataset[,-5] , test = testing_dataset[,-5] , cl = training_dataset$Purchased , k = 21 )
library(caret)
confusionMatrix(table(model_classifier , testing_dataset$Purchased))
Result
Confusion Matrix and Statistics
model_classifier 0 1
0 50 23
1 1 6
Accuracy : 0.7
What exactly is label doining? If we provide numeric labels does it have same mathematical significance