1

I am trying to use XGBoost for binary classification and as a newbie got a problem.

First, I trained model “fit”:

fit <- xgboost(
    data = dtrain #as.matrix(dat[,predictors])
    , label = label 
    #, eta = 0.1                        # step size shrinkage 
    #, max_depth = 25                   # maximum depth of tree 
    , nround=100
    #, subsample = 0.5
    #, colsample_bytree = 0.5           # part of data instances to grow tree
    #, seed = 1
    , eval_metric = "merror"        # or "mlogloss" - evaluation metric 
    , objective = "binary:logistic" #we will train a binary classification model using logistic regression for classification; anoter options: "multi:softprob", "multi:softmax" = multi class classification
    , num_class = 2                 # Number of classes in the dependent variable.
    #, nthread = 3                  # number of threads to be used 
    #, silent = 1
    #, prediction=T
)

Then I trying to use that model for the prediction of the labels for new test data.frame: predictions = predict(fit, as.matrix(test)) print(str(predictions))

As result I am getting 2 times more single probability values than I have in my test data.frame:

num [1:62210] 0.0567 0.0455 0.023 0.0565 0.0642 ...

I read, that since I am using binary classification then for each row in test data.frame I am getting 2 probabilities: for label1 and label2. But how to join that predicted list (or what is the type of that predicted object?) “predictions” with my data.frame “test” and get the predictions with the highest probability? I tried to rbind “predictions” and “test”, but getting 62k rows in the merged data.frame (instead of 31k in the initial “test”). Show me please, how to get prediction for each row?

And the second question: as I am getting in “predictions” 2 probabilities (for label1 and label2) for each row in “test” data.frame then I expected, that the sum of these 2 values should be 1. But as result for 1 test row I am getting 2 small values: 0.0455073267221451 0.0621210783720016 Their sum is much less than 1... Why is that so?

Please, explain me these 2 things. I tried, but did not find any relevant topic with the clear explanations...

Anton Arkhipkin
  • 341
  • 4
  • 15

1 Answers1

1

You need first to create the test set, a matrix where you have the p columns used on the training part, without the "outcome" variable (the y of the model).

Keep the vector as.numeric of the labels of the test set (the truth).

Then it's just a couple of istructions. I suggest caret for the confusionMatrix function.

library(caret)
library(xgboost)

test_matrix <- data.matrix(test[, -"outcome")]) # your test matrix (without the labels)
test_labels <- as.numeric(test$outcome) # the test labels
xgb_pred <- predict(fit, test_matrix) # this will give you just one probability (it will be a simple vector)
xgb_pred_class <- as.numeric(xgb_pred > 0.50) # to get your predicted labels 
# keep in mind that 0.50 is a threshold that can be modified.

confusionMatrix(as.factor(xgb_pred_class), as.factor(test_labels))
# this will get your confusion Matrix
RLave
  • 8,144
  • 3
  • 21
  • 37
  • Still have the same problem with doubled size of prediction: print(str(test_labels)) #returns num [1:31105] 1 0 0 0 0 0 0 0 0 0 ... xgb_pred <- predict(fit, test_matrix) # this will give you just one probability (it will be a simple vector) print(str(xgb_pred)) #returns num [1:62210] 0.2264 0.3579 0.0708 0.1311 0.0424 ... xgb_pred_class <- as.numeric(xgb_pred > 0.50) # to get your predicted labels print(str(xgb_pred_class)) # returns num [1:62210] 0 0 0 0 0 0 0 1 0 0 ... confusionMatrix(as.factor(xgb_pred_class), as.factor(test_labels)) #error: all arguments must have the same length – Anton Arkhipkin Jul 31 '18 at 08:28
  • one more question about the technical side of R. If I am still using objective = "binary:logistic" or "multi:softprob" and getting "num_class" * size(test) of prediction rows in one list - how technically I can rbind those few values (2 or more) to each line in "test" data.frame? Could you help me with that, please? – Anton Arkhipkin Jul 31 '18 at 11:12
  • it's hard to say without an example, maybe it's best to ask a question following this https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example, this will help people understand your problem. – RLave Jul 31 '18 at 12:13