I'm trying to use sklearn's RandomForestClassifier to classify a dataset into two categories. The training data is highly unbalanced, with about 100,000 samples in the 'False' class, and 10,000 in the 'True' class. Fitting the model on this data produced a test set accuracy of around 97% on the 'False' class and only 78% on the 'True' class. I tried downsampling the 'False' class to the same size as the 'True' class, and this led to a test accuracy of around 88% on both classes. I feel bad about throwing away around 90,000 observations though, and I wonder if a higher accuracy could be obtained if the balance problem was reduced. This led me to try changing the class_weights parameter for RandomForestClassifier to 'balanced' and fit on the original dataset.
model = RandomForestClassifier(n_estimators=100, class_weight='balanced')
model.fit(X_train, y_train)
y_pred_test = model.predict(X_test)
confusion = metrics.confusion_matrix(y_test, y_pred_test)
print("False Accuracy: ", confusion[0, 0] / confusion[:, 0].sum(), "True Accuracy: ", confusion[1, 1] / confusion[:, 1].sum())
Oddly, this change had absolutely no effect. I tried manually setting to something like class_weight={True:1000000, False:1} and this likewise has had no effect. So does reversing the previous weights. The only way I could get an effect was to set one of the weights to zero, which broke everything.
My understanding is that tweaking class_weight will adjust the function that chooses optimal splits, causing it to favor accurately classifying the classes with higher weights. Based on that understanding, I would think that setting one of the weights extremely high would have the effect of making the model always predict that class, but this isn't happening for me. Does anyone know what I could be doing wrong?