Missing entries following prediction on new data

Question

I have trained an XGB model on some labelled customer payment data, with the aim of predicting future payment behavior in two classes, as a 2 level factor.

XGB.prediction <- predict(object = XGB,
                          newdata = df.test,
                          type = "prob")

df.test is a dataframe consisting of 2534 obs. 43 variables.

XBG.prediction therefore, I expect to be 2534 obs. of 2 variables, and their probabilities. However, there are only 1416 obs.

I have tried to determine if NA values could have resulted in this

> anyNA(df.test$Class)
   [1] FALSE

This creates issues when trying to evaluate my model through ROC.

> xgb.roc <- roc(response = df.test$Class,
               auc = TRUE,
               plot = TRUE,
               predictor = XGB.prediction[,"payer"])



 Error in roc.default(response = df.test$Class, auc = TRUE, plot = TRUE,  : 
      Response and predictor must be vectors of the same length.

the model training parameters are as follows

XGB <- train(

  Class ~ .,

  data = df.train,

  trControl = ctrl,

  method = "xgbTree",

  tuneGrid = grid.xgboost,

  importance = 'impurity',

  metric = "ROC")

‎

When asking for help, you should include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. — MrFlick, Jul 30 '18 at 14:20
`XGB.prediction[,"payer"]` needs to be a numeric vector with "zeros" or "ones", the labels assigned by the predict output. Can you post the output from `str(df.test)` and show us what's inside `XGB.prediction`? — RLave, Jul 30 '18 at 14:21
`anyNA(df.test$Class) # FALSE` shows that there are no missing values of your `Class` variable, but what about the other 2533 variables? — Gregor Thomas, Jul 30 '18 at 14:25
And with 2 classes, you should expect a vector result from `predict`, with length equal to the number of rows of your `newdata` without missing values. Not a two-column data frame, just a 1-dimensional vector with the probability of the 2nd factor level (the first factor level is the baseline, and it's probability is `1 - (probability of second level)`). — Gregor Thomas, Jul 30 '18 at 14:29
@RLave df.test holds 2534 obs. of 43 variables XGB.prediction holds 1416 obs of 2 variables. The variables are: "payer", "non payer" I hope that specifies enough — Mörk Choklad, Jul 30 '18 at 14:55
@Gregor If I am interpreting what you are saying correctly, I must redefine my current Classes to a boolean and then re-run? The class column is already a factor with two levels so I did not think this would have an effect. — Mörk Choklad, Jul 30 '18 at 14:59
Maybe it would clear things up some if you showed how you trained your model? What objective and other parameters were used? — Gregor Thomas, Jul 30 '18 at 17:33
Most likely either df.test$Class or XGB.prediction[,"payer"] are NULL. Can you print those? — Calimo, Jul 30 '18 at 20:34
@Calimo I have checked both for NULL, both is.null returned false — Mörk Choklad, Jul 31 '18 at 07:29
@MörkChoklad what is happening is anyone's guess. If you had provided a reproducible example as asked by MrFlick in the first comment you would already have an answer. — Calimo, Jul 31 '18 at 07:35

Missing entries following prediction on new data

0 Answers0