0

I have a factor field Gender in a dataframe named dataset. As per my knowledge factors are similar to enumerations in C , that is each name is mapped to a number .

> dataset$Gender <- as.factor(dataset$Gender)
> str(dataset$Gender)
 Factor w/ 2 levels "Female","Male": 2 2 1 1 2 2 1 1 2 1 ...

Now while performing K-Nearest Neighbour if i pass this field as independent variable it throws an error.

Now if i provide a label to this factor field all things go well :-

> dataset$Gender <- factor(dataset$Gender , levels = c("Female","Male") , labels = c(0,1))
> str(dataset$Gender)
 Factor w/ 2 levels "0","1": 2 2 1 1 2 2 1 1 2 1 ...

What change did label make. Did it provide some numerical weight to male and female which helped in calculating Euclidean distance. If so , why did'nt the Euclidean distance got calculated on the mapping done by factors itself which is Female : 1 and Male : 2 when no labels were provided . Why did'nt this mapping of Female : 1 and Male : 2 worked in Euclidean distance calculation.

Dataset

> head(dataset)
   User.ID Gender Age EstimatedSalary Purchased
1 15624510   Male  19           19000         0
2 15810944   Male  35           20000         0
3 15668575 Female  26           43000         0
4 15603246 Female  27           57000         0
5 15804002   Male  19           76000         0
6 15728773   Male  27           58000         0

Example With Error

dataset <- read.csv("~/Desktop/Machine Learning /ML_16/Social_Network_Ads.csv")

library(caTools)

set.seed(1231)



sample_split <- sample.split(dataset$Gender , SplitRatio = 0.8)

training_dataset <- subset(dataset , sample_split == TRUE)

testing_dataset <- subset(dataset , sample_split == FALSE)


library(class)

model_classifier <- knn(train = training_dataset[,-5] , test = testing_dataset[,-5] , cl = training_dataset$Purchased , k = 21 )

library(caret)

confusionMatrix(table(model_classifier , testing_dataset$Purchased))

Error

Error in knn(train = training_dataset[, -5], test = testing_dataset[,  : 
  NA/NaN/Inf in foreign function call (arg 6)
In addition: Warning messages:
1: In knn(train = training_dataset[, -5], test = testing_dataset[,  :
  NAs introduced by coercion
2: In knn(train = training_dataset[, -5], test = testing_dataset[,  :
  NAs introduced by coercion

After Assigning Labels

dataset <- read.csv("~/Desktop/Machine Learning /ML_16/Social_Network_Ads.csv")


dataset$Gender <- factor(dataset$Gender , levels = c("Female","Male") , labels = c(0 , 1))



library(caTools)

set.seed(1231)



sample_split <- sample.split(dataset$Gender , SplitRatio = 0.8)

training_dataset <- subset(dataset , sample_split == TRUE)

testing_dataset <- subset(dataset , sample_split == FALSE)


library(class)

model_classifier <- knn(train = training_dataset[,-5] , test = testing_dataset[,-5] , cl = training_dataset$Purchased , k = 21 )

library(caret)

confusionMatrix(table(model_classifier , testing_dataset$Purchased))
  • 1
    Give us a small reproducible example and tell us what function you used (and what package it is in) to perform K-Nearest Neighbour - there are several implementations. – dcarlson Mar 07 '18 at 16:13
  • I have provided my code. Please have a look at it – Harshit Singhal Mar 07 '18 at 16:27
  • this is not a **reproducible** example (see [mcve] or https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example ). It is at first glance surprising that assigning different labels should make a difference, but maybe they get coerced to numeric? – Ben Bolker Mar 07 '18 at 16:30

0 Answers0