I use xgboost package in R.
First, I want to tune the parameters with the validation set(20% of the data set). Second, I want to get model and predict to binary classification task with 5-fold cross validation. I use 64%(80%*80%) for the train set and 16%(80%*20%) for test set and iterate this five times.
First, I use xgb.cv for tuning parameters. Related questions are here and xgboost in R: how does xgb.cv pass the optimal parameters into xgb.train .
set.seed(650)
tr.num<-sample(650,130)###I have 650 samples.
data.tuning<-data[tr.num,]
data.traintest<-data[-tr.num,]
x.tune <- data.tuning[,2:9]
x.tune <- as.matrix(x.tune)
k<-round(1+log2(130))
cv.nround <- 200 #search
bst.cv <- xgb.cv(param=param, data = x.tune, label = data.tuning[,10],nfold = k, nrounds=cv.nround, metrics=list("error"), prediction = TRUE)
......
[2] train-error:0.017573+0.008109 test-error:0.108456+0.104800
[3] train-error:0.013177+0.006646 test-error:0.100643+0.100299
[4] train-error:0.008782+0.004689 test-error:0.100643+0.100299
[5] train-error:0.003299+0.004553 test-error:0.100643+0.100299
[6] train-error:0.000000+0.000000 test-error:0.100643+0.100299
[7] train-error:0.000000+0.000000 test-error:0.108456+0.104800
[8] train-error:0.000000+0.000000 test-error:0.107996+0.086933
......
I selected nround = 7 becase of the minimun test-error.
Second, I use xgb.cv again for 5-fold cross validation in order to get the model and to know the precision and recall. But how should I do?
x.traintest <- data.traintest[,2:9]
x.traintest <- as.matrix(x.traintest)
bst.cv <- xgb.cv(param=param, data = x.traintest, label = data.traintest[,10], nrounds=7, nfold = 5)
test <- 1:104 ###650*0.16 = 104
train <- 105:520
y.traintest <- data.traintest[,10]
y.traintest <- as.matrix(y.traintest)
bst <- xgboost(param=param, data = x.traintest[train,], label=y.traintest[train,], nrounds=7, nfold = 5)
pred <- predict(bst,x.traintest[test,])
for(i in 1:length(pred)){
if(pred[i] > 0.5) {pred[i]="case"}
else {pred[i]="no"}
}
table(y.traintest[test,],pred)
Is this 5-fold cross validation and prediction? I want to get the average recall and precision of 5-fold cross validation. How should I do? I don't understand how to use PREDICTION = TRUE also.
Related questions is here, here, and here.
Do I misunderstand about cross validation or gradient boosting?