4

I am new to R programming language and I need to run "xgboost" for some experiments. The problem is that I need to cross-validate the model and get the accuracy and I found two ways that give me different results:

With "caret" using:

library(mlbench)
library(caret)
library(caretEnsemble)
dtrain <- read.csv("student-mat.csv", header=TRUE, sep=";")
formula <- G3~.
dtrain$G3<-as.factor(dtrain$G3)
control <- trainControl(method="cv", number=10)
seed <- 10
metric <- "Accuracy"    
fit.xgb <- train(formula, data=dtrain, method="xgbTree", metric=metric, trControl=control, nthread =4)
fit.xgb
fit.xgbl <- train(formula, data=dtrain, method="xgbLinear", metric=metric, trControl=control, nthread =4)
fit.xgbl

Using the "xgboost" package and the following code:

 library(xgboost)
 printArray <- function(label, array){
 cat(paste(label, paste(array, collapse = ", "), sep = ": \n"), "\n\n")
 setwd("D:\\datasets")
 dtrain <- read.csv("moodle7original(AtributosyNotaNumericos).csv",        header=TRUE, sep=",")
 label <- as.numeric(dtrain[[33]])
 data <- as.matrix(sapply(dtrain, as.numeric))

croosvalid <-
xgb.cv(
data = data,
nfold = 10,
nround = 10,
label = label,
prediction = TRUE,
objective = "multi:softmax",
num_class = 33
)

print(croosvalid)  
printArray("Actual classes", label[label != croosvalid\$pred])  
printArray("Predicted classes", croosvalid\$pred[label != croosvalid\$pred])  
correctlyClassified <- length(label[label == croosvalid\$pred])  
incorrectlyClassified <- length(label[label != croosvalid\$pred])  
accurancy <- correctlyClassified * 100 / (correctlyClassified + incorrectlyClassified)  
print(paste("Accurancy: ", accurancy))  

But the results differ very much on the same dataset. I usually get 99% accuracy on student performance dataset with the second snip of code and ~63% with the first one...
I set the same seed on both of them.

Am I wrong with the second? Please tell me why if so!

milos.ai
  • 3,882
  • 7
  • 31
  • 33
  • Without a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) it will be hard to figure out exactly why, but it likely comes down to your having different settings between the two, either through what you're explicitly providing or what the defaults are, since `caret` is just a wrapper of `xgboost` (i.e., it doesn't implement its own `xgboost` version, it just calls the same `xgboost` package) – Tchotchke Jul 18 '16 at 13:12
  • Here is the dataset used https://archive.ics.uci.edu/ml/machine-learning-databases/00320/ . Just let me know what do you need to reproduce the example. I tried with as basic parameters as I could but is still a ~25-30% difference – Stefan Paul Popescu Jul 18 '16 at 15:41

1 Answers1

2

Two things are different among codes, the first one is the most grave:

  • When you call label <- as.numeric(dtrain[[11]]) and data <- as.matrix(sapply(dtrain, as.numeric)), the 11th column in data is actually label. Of course you'll get a high accuracy, the label itself is in the data! That's grave leakage, you should instead use data <- as.matrix(sapply(dtrain[,-11L], as.numeric))

  • A minor difference is that you are using objective = "multi:softmax" in the second code, caret implements objective = "multi:softprob" for multiclass classification. I dunno how much different that might do, but it's different among codes. Check it.

catastrophic-failure
  • 3,759
  • 1
  • 24
  • 43
  • I modified dtrain[[33]] in the code. it was a copy&paste error because I also work with other datasets. softprob instead of softmax seems to be a big difference to me and the above mentioned code doesn't work anymore... But is it bad to use softmax instead of softprob? – Stefan Paul Popescu Jul 19 '16 at 07:41
  • @StefanPaulPopescu if you perform cross validation correctly, I can't see any problem with that. – catastrophic-failure Jul 19 '16 at 12:36
  • But is it correct my call for cross validation? I mean, the call from the code: croosvalid <- xgb.cv(...)? – Stefan Paul Popescu Jul 19 '16 at 17:45