0

Im trying to understand how a xgboost works for a multiclass problem. I have used the IRIS dataset to predict which species an input belongs to based on its characteristics and computed results in R.

The code is below

test <- as.data.frame(iris)
test$y <- ifelse(test$Species=="setosa",0,
                 (ifelse(test$Species=="versicolor",1,
                         (ifelse(test$Species=="virginica",2,3)))))

x_iris <- test[,c("Sepal.Length","Sepal.Width","Petal.Length","Petal.Width")]
y_iris <- test[,"y"]

iris_model <- xgboost(data = data.matrix(x_iris), label = y_iris, eta = 0.1, base_score = 0.5, nround=1, 
                     subsample = 1, colsample_bytree = 1, num_class = 3, max_depth = 4, lambda = 0,
                     eval_metric = "mlogloss", objective = "multi:softprob")

xgb.plot.tree(model = iris_model, feature_names = colnames(x_iris))

I tried to manually compute the results and compare the gain and cover value with the R output. I have noticed a couple of things:

  1. The initial probability is always 1/(num of classes) irrespective of what we provide in the ‘base_score’ parameter in R. The 'base_score' actually gets added at the end, to the final log_odds value and it matches with the R output when we run the predict function to get log of odds. In the case of binary classification, the ‘base_score’ parameter is used as initial probability for the model.
predict(iris_model,data.matrix(x_iris), reshape = TRUE, outputmargin = FALSE)
  1. The loss function is (2.0f * p * (1.0f - p) * wt) for multiclass problems and (p * (1.0f - p) * wt) for binary problems.

There is an explanation for loss function in the github repo https://github.com/dmlc/xgboost/issues/638 , but no info on why the base_score gets added at the end.

Is it because the algorithm in R was designed this way or does the XGBoost multiclass algorithm work like this?

akshay
  • 1
  • 1
  • 1
    Hi @akshay, I wrote a response to a similar question earlier today that might be of use to you: https://stackoverflow.com/questions/62350750/what-is-the-use-of-base-score-in-xgboost-multiclass-working/62379590#62379590 – jared_mamrot Jun 15 '20 at 05:56
  • Thanks @jared_mamrot. I understand that altering 'base_score' affects the number of trees, but in the case of multclass, i found that the initial probability is always equal to 1/n no matter what value we specify as 'base_score' parameter in R. When i try to manually calculate the steps using the same parameters specified in the code and compare with the R output, the difference in the final log odds is always the 'base_score' value which we specified. My goal is to understand how the algo works by comparing with R and trying to match the results – akshay Jun 15 '20 at 08:27

0 Answers0