First post here, I'll try and make it quick. I presented my final project in gaussian process and random forest today in which I ran them through r on the same data set to compare/contrast. My professor said that my gaussian process was wrong because my variables were categorical but I looked and they seem just integer and numeric, even though some of the variables are like "education level" and "marital status", they're just 1's and 0's. I'll post my code below and I would be forever indebted if someone can explain to me why this is not okay or if it's okay!
> library(kernlab)
> load("/Users/benjaminfoster/Desktop/bank_data.Rdata")
>
> str(data_train)
'data.frame': 2260 obs. of 11 variables:
$ age : int 59 39 41 43 31 40 37 25 31 42 ...
$ marital : num 1 1 1 1 1 1 0 0 1 0 ...
$ education: num 0 0 1 0 0 1 1 0 0 1 ...
$ default : num 0 0 0 0 0 0 0 0 0 0 ...
$ balance : int 0 147 221 264 360 194 2317 -221 132 16 ...
$ housing : num 1 1 1 1 1 0 1 1 0 0 ...
$ loan : num 0 0 0 0 1 1 0 0 0 0 ...
$ duration : int 226 151 57 113 89 189 114 250 148 140 ...
$ campaign : int 1 2 2 2 1 2 1 1 1 3 ...
$ previous : int 0 0 0 0 1 0 2 0 1 0 ...
$ y : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
> str(data_test)
'data.frame': 2261 obs. of 11 variables:
$ age : int 30 33 35 30 35 36 43 39 36 20 ...
$ marital : num 1 1 0 1 0 1 1 1 1 0 ...
$ education: num 0 0 1 1 1 1 0 0 1 0 ...
$ default : num 0 0 0 0 0 0 0 0 0 0 ...
$ balance : int 1787 4789 1350 1476 747 307 -88 9374 1109 502 ...
$ housing : num 0 1 1 1 0 1 1 1 0 0 ...
$ loan : num 0 1 0 1 0 0 1 0 0 0 ...
$ duration : int 79 220 185 199 141 341 313 273 328 261 ...
$ campaign : int 1 1 1 4 2 1 1 1 2 1 ...
$ previous : int 0 4 1 0 3 2 2 0 0 0 ...
$ y : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 2 ...
>
> #response variable is 'y'
> gp_model <- gausspr(y~., data=data_train)
Using automatic sigma estimation (sigest) for RBF or laplace kernel
>
> pred_y <- predict(gp_model, newdata=data_test, type = "probabilities")
>
> #compute prediction accuracy with 0.5 threshold
> mean((pred_y[,2]>0.5)==(data_test$y==1))
[1] 0.8894295
>
> #creating ROC curve and compute area under the curve
> library(ROCit)
> ROCit_obj <- rocit(score=pred_y[,2], class=data_test$y)
> plot(ROCit_obj)
> ROCit_obj$AUC
[1] 0.8470083
>
> #Confusion Matrix
> table((pred_y[,2]>0.5),(data_test$y=='1'))[c(2,1),c(2,1)]
TRUE FALSE
TRUE 51 52
FALSE 198 1960