0

First post here, I'll try and make it quick. I presented my final project in gaussian process and random forest today in which I ran them through r on the same data set to compare/contrast. My professor said that my gaussian process was wrong because my variables were categorical but I looked and they seem just integer and numeric, even though some of the variables are like "education level" and "marital status", they're just 1's and 0's. I'll post my code below and I would be forever indebted if someone can explain to me why this is not okay or if it's okay!

> library(kernlab)
> load("/Users/benjaminfoster/Desktop/bank_data.Rdata")
> 
> str(data_train)
'data.frame':   2260 obs. of  11 variables:
 $ age      : int  59 39 41 43 31 40 37 25 31 42 ...
 $ marital  : num  1 1 1 1 1 1 0 0 1 0 ...
 $ education: num  0 0 1 0 0 1 1 0 0 1 ...
 $ default  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ balance  : int  0 147 221 264 360 194 2317 -221 132 16 ...
 $ housing  : num  1 1 1 1 1 0 1 1 0 0 ...
 $ loan     : num  0 0 0 0 1 1 0 0 0 0 ...
 $ duration : int  226 151 57 113 89 189 114 250 148 140 ...
 $ campaign : int  1 2 2 2 1 2 1 1 1 3 ...
 $ previous : int  0 0 0 0 1 0 2 0 1 0 ...
 $ y        : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
> str(data_test)
'data.frame':   2261 obs. of  11 variables:
 $ age      : int  30 33 35 30 35 36 43 39 36 20 ...
 $ marital  : num  1 1 0 1 0 1 1 1 1 0 ...
 $ education: num  0 0 1 1 1 1 0 0 1 0 ...
 $ default  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ balance  : int  1787 4789 1350 1476 747 307 -88 9374 1109 502 ...
 $ housing  : num  0 1 1 1 0 1 1 1 0 0 ...
 $ loan     : num  0 1 0 1 0 0 1 0 0 0 ...
 $ duration : int  79 220 185 199 141 341 313 273 328 261 ...
 $ campaign : int  1 1 1 4 2 1 1 1 2 1 ...
 $ previous : int  0 4 1 0 3 2 2 0 0 0 ...
 $ y        : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 2 ...
> 
> #response variable is 'y'
> gp_model <- gausspr(y~., data=data_train)
Using automatic sigma estimation (sigest) for RBF or laplace kernel 
> 
> pred_y <- predict(gp_model, newdata=data_test, type = "probabilities")
> 
> #compute prediction accuracy with 0.5 threshold
> mean((pred_y[,2]>0.5)==(data_test$y==1))
[1] 0.8894295
> 
> #creating ROC curve and compute area under the curve
> library(ROCit)
> ROCit_obj <- rocit(score=pred_y[,2], class=data_test$y)
> plot(ROCit_obj)
> ROCit_obj$AUC
[1] 0.8470083
> 
> #Confusion Matrix
> table((pred_y[,2]>0.5),(data_test$y=='1'))[c(2,1),c(2,1)]
       
        TRUE FALSE
  TRUE    51    52
  FALSE  198  1960

  • Hi @skipfoster, Welcome to SO! It looks like your ```y``` variable is a factor? So it sounds like your prof wants you to have that as ```int``` type instead? – Russ Thomas Dec 01 '20 at 00:26
  • Thanks a lot! First time poster but have been browsing the forums for the past two years! Yes, my y variable is a factor. It indicates whether or not members of a bank subscribe to a telemarketing offer to open a deposit account, 0=decline offer and 1=accept offer. The other variables are the members' information and the purpose of this project is to predict who will accept/decline based off their input variables – skipfoster Dec 01 '20 at 00:30
  • Hi @skipfoster, so based on what you say your professor has said, I think the prof wants you to convert ```y``` to a non-factor, so you could do ```y <- as.numeric(levels(y))[y]``` immediately after reading in your rdata file and before running your model. You can read more about this command at [How to convert a factor to integer\numeric without loss of information?](https://stackoverflow.com/questions/3418128/how-to-convert-a-factor-to-integer-numeric-without-loss-of-information). Perhaps run this by your professor to confirm that is what was being suggested? – Russ Thomas Dec 01 '20 at 01:37

1 Answers1

0

A categorical variable is a variable that can take on one of a limited, and usually fixed, number of possible values. So, your professor is right, the variables, despite being represented by numbers, are categorical variables because they do not represent quantities but in a certain sense a state.

manuzambo
  • 191
  • 7
  • Thanks for the quick reply! I'm a little confused because I got VERY similar results with my random forest code. Is there a way I can change anything about the variables to run it with gaussian process again? – skipfoster Dec 01 '20 at 00:24