0

I have trained an XGB model on some labelled customer payment data, with the aim of predicting future payment behavior in two classes, as a 2 level factor.

XGB.prediction <- predict(object = XGB,
                          newdata = df.test,
                          type = "prob")

df.test is a dataframe consisting of 2534 obs. 43 variables.

XBG.prediction therefore, I expect to be 2534 obs. of 2 variables, and their probabilities. However, there are only 1416 obs.

I have tried to determine if NA values could have resulted in this

> anyNA(df.test$Class)
   [1] FALSE

This creates issues when trying to evaluate my model through ROC.

> xgb.roc <- roc(response = df.test$Class,
               auc = TRUE,
               plot = TRUE,
               predictor = XGB.prediction[,"payer"])



 Error in roc.default(response = df.test$Class, auc = TRUE, plot = TRUE,  : 
      Response and predictor must be vectors of the same length.

the model training parameters are as follows

XGB <- train(

  Class ~ .,

  data = df.train,

  trControl = ctrl,

  method = "xgbTree",

  tuneGrid = grid.xgboost,

  importance = 'impurity',

  metric = "ROC")

  • 2
    When asking for help, you should include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. – MrFlick Jul 30 '18 at 14:20
  • `XGB.prediction[,"payer"]` needs to be a numeric vector with "zeros" or "ones", the labels assigned by the predict output. Can you post the output from `str(df.test)` and show us what's inside `XGB.prediction`? – RLave Jul 30 '18 at 14:21
  • 1
    `anyNA(df.test$Class) # FALSE` shows that there are no missing values of your `Class` variable, but what about the other 2533 variables? – Gregor Thomas Jul 30 '18 at 14:25
  • And with 2 classes, you should expect a vector result from `predict`, with length equal to the number of rows of your `newdata` without missing values. Not a two-column data frame, just a 1-dimensional vector with the probability of the 2nd factor level (the first factor level is the baseline, and it's probability is `1 - (probability of second level)`). – Gregor Thomas Jul 30 '18 at 14:29
  • @RLave df.test holds 2534 obs. of 43 variables XGB.prediction holds 1416 obs of 2 variables. The variables are: "payer", "non payer" I hope that specifies enough – Mörk Choklad Jul 30 '18 at 14:55
  • @Gregor If I am interpreting what you are saying correctly, I must redefine my current Classes to a boolean and then re-run? The class column is already a factor with two levels so I did not think this would have an effect. – Mörk Choklad Jul 30 '18 at 14:59
  • Maybe it would clear things up some if you showed how you trained your model? What objective and other parameters were used? – Gregor Thomas Jul 30 '18 at 17:33
  • Most likely either df.test$Class or XGB.prediction[,"payer"] are NULL. Can you print those? – Calimo Jul 30 '18 at 20:34
  • @Calimo I have checked both for NULL, both is.null returned false – Mörk Choklad Jul 31 '18 at 07:29
  • @MörkChoklad what is happening is anyone's guess. If you had provided a reproducible example as asked by MrFlick in the first comment you would already have an answer. – Calimo Jul 31 '18 at 07:35

0 Answers0