Currently, I am implementing RandomForestClassifier in Sklearn for my imbalanced data. I am not very clear about how RF works in Sklearn exactly. Here are my concerns as follows:
- According to the documents, it seemed that there is no way to set the sub-sample size (i.e smaller than the original data size) for each tree learner. But in fact, in random forest algo, we need to get both subsets of samples and subsets of features for each tree. I am not sure can we achieve that via Sklearn? If yes, how?
Follwoing is the description of RandomForestClassifier in Sklearn.
"A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and use averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default)."
Here I found a similar question before. But not many answers for this question.
How can SciKit-Learn Random Forest sub sample size may be equal to original training data size?
- For imbalanced data, if we could do sub-sample pick-up via Sklearn (i.e solve the question #1 above), can we do balanced-random forest? i.e. for each tree learner, it will pick up a subset from less-populated class, and also pick up the same number of samples from more-populated class to make up an entire training set with equal distribution of two classes. Then repeat the process for a batch of times (i.e. # of trees).
Thank you! Cheng