Why KNN shows different prediction when factor labels are changed in R

Question

What change occur when we provide a label to a factor field ? In the following code i have once assigned labels as 0 and 1 and the next time i have assigned labels as 0 and 10^6 . As per my knowledge labels are just providing the alternate name to the categories which in this case are Male and Female. Please note i have provided numeric labels not character labels.

It seems like labels are providing some sort of numeric weight to the categories which change the eucladian distance for a datapoint. Below provided are two codes for the problem with the corresponding results

Dataset

> head(dataset)
   User.ID Gender Age EstimatedSalary Purchased
1 15624510      1  19           19000         0
2 15810944      1  35           20000         0
3 15668575      0  26           43000         0
4 15603246      0  27           57000         0
5 15804002      1  19           76000         0
6 15728773      1  27           58000         0

R code with labels = c(0 , 1)

    dataset <- read.csv("~/Desktop/Machine Learning /ML_16/Social_Network_Ads.csv")


dataset$Gender <- factor(dataset$Gender , levels = c("Female","Male") , labels = c(0 , 1))



library(caTools)

set.seed(1231)



sample_split <- sample.split(dataset$Gender , SplitRatio = 0.8)

training_dataset <- subset(dataset , sample_split == TRUE)

testing_dataset <- subset(dataset , sample_split == FALSE)


library(class)

model_classifier <- knn(train = training_dataset[,-5] , test = testing_dataset[,-5] , cl = training_dataset$Purchased , k = 21 )

library(caret)

confusionMatrix(table(model_classifier , testing_dataset$Purchased))

Result

Confusion Matrix and Statistics


model_classifier  0  1
               0 47 18
               1  4 11

Accuracy : 0.725

R code with labels = c(0 , 10^6)

dataset <- read.csv("~/Desktop/Machine Learning /ML_16/Social_Network_Ads.csv")


dataset$Gender <- factor(dataset$Gender , levels = c("Female","Male") , labels = c(0 , 10^6))



library(caTools)

set.seed(1231)



sample_split <- sample.split(dataset$Gender , SplitRatio = 0.8)

training_dataset <- subset(dataset , sample_split == TRUE)

testing_dataset <- subset(dataset , sample_split == FALSE)


library(class)

model_classifier <- knn(train = training_dataset[,-5] , test = testing_dataset[,-5] , cl = training_dataset$Purchased , k = 21 )

library(caret)

confusionMatrix(table(model_classifier , testing_dataset$Purchased))

Result

Confusion Matrix and Statistics


model_classifier  0  1
               0 50 23
               1  1  6

Accuracy : 0.7

What exactly is label doining? If we provide numeric labels does it have same mathematical significance

When asking for help, you should include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. Don't point to CSV files that we don't have. See the link for ways to include data in your question itself via `dput()`. — MrFlick, Mar 07 '18 at 17:40
Why would `User.ID` be a relevant variable in prediction? If you want to use a dichotomy, just code it as 0 and 1. The knn() function converts the data frame to a numeric matrix so factor levels/labels become numeric values if they can be converted or NAs otherwise. — dcarlson, Mar 07 '18 at 18:18

Why KNN shows different prediction when factor labels are changed in R

0 Answers0