1

I'm working with an unbalanced classification problem, in which the target variable contains:

np.bincount(y_train)
array([151953,  13273])

i.e. 151953 zeroes and 13273 ones.

To deal with this I'm using XGBoost's weight parameter when defining the DMatrix:

dtrain = xgb.DMatrix(data=X_train, 
                     label=y_train,
                     weight=weights)

For the weights I've been using:

bc = np.bincount(y_train)
n_samples = bc.sum()
n_classes = len(bc)
weights = n_samples / (n_classes * bc)
w = weights[y_train.values]

Where weightsis array([0.54367469, 6.22413923]), and with the last line of code I'm just indexing it using the binary values in y_train. This seems like the correct approach to define the weights, since it represents the actual ratio between the amount of values of one class vs the other. However this seems to be favoring the minoritary class, which can be seen by inspecting the confusion matrix:

array([[18881, 19195],
       [  657,  2574]])

So just by trying out different weight values, I've realized that with a fairly close weight ratio, specifically array([1, 7]), the results seem much more reasonable:

array([[23020, 15056],
       [  837,  2394]])

So my question is:

  • Why using the actual weights of each class is yielding poor metrics?
  • Which is the right way to set the weights for an unbalanced problem?
yatu
  • 86,083
  • 12
  • 84
  • 139

2 Answers2

2

Internally, xgboost uses the input weights to boost the contribution of the samples from the minority class to the loss function through multiplying calculated gradients and hessians by the weights [ref].

While promising and popular, there is no proof that the method you have mentioned would result in the best performance (it also depends on how the other hyper-parameters are set, data distributions, and the metric used); it is just a heuristic. You may want to use ROC-AUC too for evaluation (as recommended by xgboost). Like most other hyper-parameters, a more systematic method of optimizing weights is grid search. Here is an implementation.

Reveille
  • 4,359
  • 3
  • 23
  • 46
  • 1
    Sorry, I overlooked this answer since I slightly misunderstood it really. But rereading it again I've realized I actually ended up implementing something very close to what is proposed in the link. Thank you! – yatu Apr 27 '20 at 14:07
  • Glad it helped. Please feel free to edit it for clarity. I'll also try to do so. – Reveille Apr 28 '20 at 14:54
  • Ok, will try complement it with what I ended up doing. Btw, have you managed to find a good working solution to [this](https://stackoverflow.com/questions/61022427/find-nearest-transition-in-n-dimensional-array)? – yatu Apr 28 '20 at 14:57
  • Nope, really. I was thinking of putting a bounty on it as it would be a universal method of estimating distance to boundary but then it got downvoted and flagged for closure. Perhaps need to break it down to multiple questions. – Reveille Apr 28 '20 at 15:10
  • 2
    I've been giving it some thought... Not sure how to tackle, but I'd like to see different approaches. I think It'd be interesting if you better explain the actual problem, or why you want this, I didn't really understand. Will give it a look when I have time, seems like a tough one – yatu Apr 28 '20 at 15:12
0

It seems you are using a binary classification model. For binary problems, XGBoost has a hyperparameter called scale_pos_weight which balances the ratio between your positive and negative classes. As per the documentation the value of the scale_pos_weight is calculated by the formula.

scale_pos_weight = sum(negative instances) / sum(positive instances)

This parameter can be tuned as well so you can use methods like GridSearchCV to find out the best parameters.

Vatsal Gupta
  • 471
  • 3
  • 8