3

I'm new to ML and I've been working with an imbalanced data set where the count of negative samples is twice that of the positive samples. In-order to address these i set scikit-learn Random forest class_weight = 'balanced', which gave me an ROC-AUC score of 0.904 and the recall for class- 1 was 0.86, now when i tried to further improve the AUC Score by assigning weight, there wasn't any major difference with the results, i.e Class_weight = {0: 0.5, 1: 2.75}, assuming this would penalize for every wrong classification of 1 but it didn't seem to work as expected.

randomForestClf = RandomForestClassifier(random_state = 42, class_weight = {0: 0.5, 1:2.75})

Tried different values but has no major impact as Recall of 1 remains the same or reduces (0.85) and auc value is quite insignificant (0.90122). It only seems to work when one of the label is set 0. Further tried to set the sample weights too. But that didn't seem to work either.

# Sample weights
class_weights = [0.5, 2]
weights = np.ones(y_train.shape[0], dtype = 'float')
for i, val in enumerate(y_train):
    weights[i] = class_weights[val]

Below is the reference to a similar question but the solutions provided didn't work for me. sklearn RandomForestClassifier's class_weights seems to have no effect

Is there anything that i'm missing out? Thanks!

RB10
  • 43
  • 4
  • Can you provide a dataset that demonstrates the same issue? – Ben Reiniger Jul 08 '22 at 16:22
  • I've stored the dataset as a pickle file at kaggle. Please follow the link below to be able to download the dataset https://www.kaggle.com/datasets/reesha10/random-forest-cleaned-data-la – RB10 Jul 09 '22 at 07:39
  • That link just leads me to a page with "Not found" – Ben Reiniger Jul 12 '22 at 16:13
  • I'm sorry, I had left it on private. I changed it to public now, you should now be able to download the pickle file. – RB10 Jul 13 '22 at 08:46

1 Answers1

1

The reason is that you grow the trees out fully, which leads to every leaf node being pure. That will happen regardless of the class weights (though the structure of the tree leading up to those pure nodes will change). The predicted probabilities of each tree will be (almost) all 0 or 1, and so the overall probability estimates are just driven by disagreements between the trees.

If you set e.g. max_depth=10 (or whatever tree complexity parameter you like), now many/most of the leaf nodes will not be pure. Setting larger positive-class weights will produce leaf values that are biased toward the positive class (but still aren't just 0 and 1), and so the probability estimates will be skewed higher across the board, leading to a higher recall (at the expense of precision, presumably).

The ROC curve is relatively unaffected by class balance and the skewed-higher probabilities arising from the larger weights, and so shouldn't be heavily affected by changing weights, for a fixed max_depth.

Ben Reiniger
  • 10,517
  • 3
  • 16
  • 29
  • Thanks! The best max depth value obtained was 24 based on grid search. So is it preferable to grow out the trees fully? Or set max depth and then use class weight? – RB10 Jul 14 '22 at 06:16
  • @RB10 which to choose depends on your needs. I've personally always found pruned trees to perform better in random forests, and grid search results are probably what I would go with. – Ben Reiniger Jul 14 '22 at 12:43