1

I'm trying to use sklearn's RandomForestClassifier to classify a dataset into two categories. The training data is highly unbalanced, with about 100,000 samples in the 'False' class, and 10,000 in the 'True' class. Fitting the model on this data produced a test set accuracy of around 97% on the 'False' class and only 78% on the 'True' class. I tried downsampling the 'False' class to the same size as the 'True' class, and this led to a test accuracy of around 88% on both classes. I feel bad about throwing away around 90,000 observations though, and I wonder if a higher accuracy could be obtained if the balance problem was reduced. This led me to try changing the class_weights parameter for RandomForestClassifier to 'balanced' and fit on the original dataset.

model = RandomForestClassifier(n_estimators=100, class_weight='balanced')
model.fit(X_train, y_train)
y_pred_test = model.predict(X_test)
confusion = metrics.confusion_matrix(y_test, y_pred_test)
print("False Accuracy: ", confusion[0, 0] / confusion[:, 0].sum(), "True Accuracy: ", confusion[1, 1] / confusion[:, 1].sum())

Oddly, this change had absolutely no effect. I tried manually setting to something like class_weight={True:1000000, False:1} and this likewise has had no effect. So does reversing the previous weights. The only way I could get an effect was to set one of the weights to zero, which broke everything.

My understanding is that tweaking class_weight will adjust the function that chooses optimal splits, causing it to favor accurately classifying the classes with higher weights. Based on that understanding, I would think that setting one of the weights extremely high would have the effect of making the model always predict that class, but this isn't happening for me. Does anyone know what I could be doing wrong?

Posionus
  • 57
  • 4
  • "no effect" means the two printed scores? How do the predicted probabilities or AUC compare? – Ben Reiniger Jun 22 '20 at 15:25
  • 1
    AUC with class_weight=None is 0.9153, with class_weight='balanced' is 0.9130. Setting class_weight={True:1000000, False:1} yields AUC of 0.9159. – Posionus Jun 22 '20 at 16:40
  • Can you change True to 1 in your class_weight dictionary? I know that these are equal but they are not, when it comes to type. sklearn might simply not recognise your weights. – N. Kiefer Jun 24 '20 at 11:41

0 Answers0