0

I'm running an XGBoost binary classification model with Training 375 observation and 125 Testing observations and 19 features. Below are my arguments:

Boosted Tree Model Specification (classification)

Main Arguments:
  mtry = 13
  trees = 100
  min_n = 3
  tree_depth = 5
  learn_rate = 1.57515292756891e-09
  loss_reduction = 0.801337205143451
  sample_size = 0.967102140800562

Computational engine: xgboost 

The model performs well

Confusion matrix

But there is no distribution in the class probability .50001 vs .49999 ROC

I'm new to using XGBoost, is this an overfitting issue, sample size issue, am I miss specifying the arguments? I feel like there is an obvious issue that I would love to be educated about.

Using R, tidymodels

Curtis
  • 159
  • 7
  • How did you generate the ROC curve plot? Using Yardstick e.g. https://yardstick.tidymodels.org/reference/roc_curve.html ? – jared_mamrot Jan 27 '21 at 03:12
  • Yes - collect_predictions() %>% roc_curve(., Truth, pre_class) – Curtis Jan 27 '21 at 16:52
  • Hmm...Sorry @Curtis, not sure what's going on. If the confusion matrix above is based on training data then it indicates overfitting (extreme overfitting) where the model is unable to find any difference between groups in the test data, but that seems unlikely given your parameters. Are you able to share your data? – jared_mamrot Jan 27 '21 at 22:46
  • @jared_mamrot unfortunately I'm not but I appreciate your thoughts. I'll re-examine the features as many of them are zero-inflated with low variation. I'm not sure if that would impact this issue but regardless it is AN issue that I'll need to address. – Curtis Jan 28 '21 at 00:32
  • If you can replicate the issue using a publicly-available dataset (e.g. `install.packages("titanic"); library(titanic); data("Titanic"); training_data <- titanic_train`) you could repost the question and see what others say (see https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) – jared_mamrot Jan 28 '21 at 00:44

0 Answers0