0

I need to do a classification task on this dataset. As the following code shows, I tried to implement xgboost using caret package. Since my dataset is imbalanced, I prefer to use Fscore as performance measure. Furthermore, I need to use the first 700000 instances as the train set and the remaining 150000 instances as the test set. As the commented part of my code shows, I read this post and other related posts. However, I could not solve the issue.

mytrainvalid <- read.csv("mytrainvalid.csv")
library(xgboost)
library(dplyr)
library(caret)

mytrainvalid$DEFAULT <- ifelse(mytrainvalid$DEFAULT != 0,
                               "one",
                               "zero")
mytrainvalid$DEFAULT <- as.factor(mytrainvalid$DEFAULT)

input_x <- as.matrix(select(mytrainvalid, -DEFAULT))
## Use the validation index in the trainControl
ind=as.integer(rownames(mytrainvalid))
vi=c(700001:850000)

# modelling
grid_default <- expand.grid(
  nrounds = c(100,200),
  max_depth = 6,
  eta = 0.1,
  gamma = 0,
  colsample_bytree = 1,
  min_child_weight = 1,
  subsample = 1
)
## use fScore as data is imbalance: 20:1
f1 <- function (data, lev = NULL, model = NULL) {
  precision <- posPredValue(data$pred, data$obs, positive = "pass")
  recall  <- sensitivity(data$pred, data$obs, postive = "pass")
  f1_val <- (2 * precision * recall) / (precision + recall)
  names(f1_val) <- c("F1")
  f1_val
} 
##
data.ctrl <- trainControl(method = "cv",
                          number = 1, 
                          allowParallel=TRUE,
                          returnData = FALSE,
                          index = list(Fold1=(1:ind)[-vi]),
                          sampling = "smote",
                          classProbs = TRUE,
                          summaryFunction = f1,
                          savePredictions = "final",
                          verboseIter=TRUE,
                          search = "random", 
                          #savePred=T
)

xgb_model <-caret::train (input_x,
                          mytrainvalid$DEFAULT,
                          method="xgbTree",
                          trControl=data.ctrl,
                          #tuneGrid=grid_default,
                          verbose=FALSE,
                          metric = "F1",
                          classProbs=TRUE,
                          #linout=FALSE,
                          #threshold = 0.3,
                          #scale_pos_weight = sum(input_y$DEFAULT == "no")/sum(input_y$DEFAULT == "yes"),
                          #maximize = FALSE,
                          tuneLength = 2,
)

Unfortunately, the following error is produced:

Something is wrong; all the F1 metric values are missing:
       F1     
 Min.   : NA  
 1st Qu.: NA  
 Median : NA  
 Mean   :NaN  
 3rd Qu.: NA  
 Max.   : NA  
 NA's   :2    
Error: Stopping
In addition: Warning messages:
1: model fit failed for Fold1: eta=0.09121, max_depth=8, gamma=7.227, colsample_bytree=0.6533, min_child_weight=15, subsample=0.9783, nrounds=800 Error in createModel(x = subset_x(x, modelIndex), y = y[modelIndex], wts = wts[modelIndex],  : 
  formal argument "classProbs" matched by multiple actual arguments
 
2: model fit failed for Fold1: eta=0.15119, max_depth=8, gamma=8.877, colsample_bytree=0.4655, min_child_weight= 3, subsample=0.9515, nrounds=536 Error in createModel(x = subset_x(x, modelIndex), y = y[modelIndex], wts = wts[modelIndex],  : 
  formal argument "classProbs" matched by multiple actual arguments
 
3: In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,  :
  There were missing values in resampled performance measures.
ebrahimi
  • 912
  • 2
  • 13
  • 32
  • remove `classProbs=TRUE` from `caret::train`. Does it solve the issue? – missuse Dec 16 '22 at 05:51
  • @missuse Thanks a lot. Unfortunately, the same error is still reported. – ebrahimi Dec 16 '22 at 06:04
  • Did you try running just xgboost without caret on the dataset? Also try to change the matric to accuracy in caret and try again, remove class probs and see if it helps. Would help diagnose the issue. – missuse Dec 16 '22 at 06:29
  • @missuse I tried it without caret and this error did not report, but I cannot understand what the problem is with this code? I also tried accuracy in caret, again this error is reported. Thanks. – ebrahimi Dec 16 '22 at 07:25
  • As far as I am aware of caret hasn't been updated for quite some time, so potentially some breaking updates to xgboost could have potentially caused this. You can verify with a simpler dataset and train call (no custom metrics and such). Any way I suggest switching either to the more modern version - tidymodels (caret successor) or mlr3 (I prefer this one although it might be harder to get it into at first it offers more control/functionality in my opinion). – missuse Dec 16 '22 at 08:39

0 Answers0