I need to do a classification task on this dataset. As the following code shows, I tried to implement xgboost using caret package. Since my dataset is imbalanced, I prefer to use Fscore as performance measure. Furthermore, I need to use the first 700000 instances as the train set and the remaining 150000 instances as the test set. As the commented part of my code shows, I read this post and other related posts. However, I could not solve the issue.
mytrainvalid <- read.csv("mytrainvalid.csv")
library(xgboost)
library(dplyr)
library(caret)
mytrainvalid$DEFAULT <- ifelse(mytrainvalid$DEFAULT != 0,
"one",
"zero")
mytrainvalid$DEFAULT <- as.factor(mytrainvalid$DEFAULT)
input_x <- as.matrix(select(mytrainvalid, -DEFAULT))
## Use the validation index in the trainControl
ind=as.integer(rownames(mytrainvalid))
vi=c(700001:850000)
# modelling
grid_default <- expand.grid(
nrounds = c(100,200),
max_depth = 6,
eta = 0.1,
gamma = 0,
colsample_bytree = 1,
min_child_weight = 1,
subsample = 1
)
## use fScore as data is imbalance: 20:1
f1 <- function (data, lev = NULL, model = NULL) {
precision <- posPredValue(data$pred, data$obs, positive = "pass")
recall <- sensitivity(data$pred, data$obs, postive = "pass")
f1_val <- (2 * precision * recall) / (precision + recall)
names(f1_val) <- c("F1")
f1_val
}
##
data.ctrl <- trainControl(method = "cv",
number = 1,
allowParallel=TRUE,
returnData = FALSE,
index = list(Fold1=(1:ind)[-vi]),
sampling = "smote",
classProbs = TRUE,
summaryFunction = f1,
savePredictions = "final",
verboseIter=TRUE,
search = "random",
#savePred=T
)
xgb_model <-caret::train (input_x,
mytrainvalid$DEFAULT,
method="xgbTree",
trControl=data.ctrl,
#tuneGrid=grid_default,
verbose=FALSE,
metric = "F1",
classProbs=TRUE,
#linout=FALSE,
#threshold = 0.3,
#scale_pos_weight = sum(input_y$DEFAULT == "no")/sum(input_y$DEFAULT == "yes"),
#maximize = FALSE,
tuneLength = 2,
)
Unfortunately, the following error is produced:
Something is wrong; all the F1 metric values are missing:
F1
Min. : NA
1st Qu.: NA
Median : NA
Mean :NaN
3rd Qu.: NA
Max. : NA
NA's :2
Error: Stopping
In addition: Warning messages:
1: model fit failed for Fold1: eta=0.09121, max_depth=8, gamma=7.227, colsample_bytree=0.6533, min_child_weight=15, subsample=0.9783, nrounds=800 Error in createModel(x = subset_x(x, modelIndex), y = y[modelIndex], wts = wts[modelIndex], :
formal argument "classProbs" matched by multiple actual arguments
2: model fit failed for Fold1: eta=0.15119, max_depth=8, gamma=8.877, colsample_bytree=0.4655, min_child_weight= 3, subsample=0.9515, nrounds=536 Error in createModel(x = subset_x(x, modelIndex), y = y[modelIndex], wts = wts[modelIndex], :
formal argument "classProbs" matched by multiple actual arguments
3: In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
There were missing values in resampled performance measures.