How can SciKit-Learn Random Forest sub sample size may be equal to original training data size?

Question

In the documentation of SciKit-Learn Random Forest classifier , it is stated that

The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default).

What I dont understand is that if the sample size is always the same as the input sample size than how can we talk about a random selection. There is no selection here because we use all the (and naturally the same) samples at each training.

Am I missing something here?

score 6 · Accepted Answer · edited Mar 07 '16 at 18:38

I believe this part of docs answers your question

In random forests (see RandomForestClassifier and RandomForestRegressor classes), each tree in the ensemble is built from a sample drawn with replacement (i.e., a bootstrap sample) from the training set. In addition, when splitting a node during the construction of the tree, the split that is chosen is no longer the best split among all features. Instead, the split that is picked is the best split among a random subset of the features. As a result of this randomness, the bias of the forest usually slightly increases (with respect to the bias of a single non-random tree) but, due to averaging, its variance also decreases, usually more than compensating for the increase in bias, hence yielding an overall better model.

The key to understanding is in "sample drawn with replacement". This means that each instance can be drawn more than once. This in turn means, that some instances in the train set are present several times and some are not present at all (out-of-bag). Those are different for different trees

This part is OK, it is stated that the features are randomly chosen at the construction of each split in a _single_ tree. However, what I wonder is that is there a difference between the set of observations (in other words the matrix "X") which are used to train each different tree (I dont mean splits in a single tree here). — TAK, Mar 07 '16 at 09:45
It is still not clear - 'sample size is the same as the input sample size' does this mean that the sample size for each decision tree == the total number of training instances? If so, it means you are selecting the entire training set every time. — yegodz, Jul 16 '18 at 21:06

score 2 · Answer 2 · answered Mar 07 '16 at 15:46

Certainly not all samples are selected for each tree. Be default each sample has a 1-((N-1)/N)^N~0.63 chance of being sampled for one particular tree and 0.63^2 for being sampled twice, and 0.63^3 for being sampled 3 times... where N is the sample size of the training set.

Each bootstrap sample selection is in average enough different from other bootstraps, such that decision trees are adequately different, such that the average prediction of trees is robust toward the variance of each tree model. If sample size could be increased to 5 times more than training set size, every observation would probably be present 3-7 times in each tree and the overall ensemble prediction performance would suffer.

score 1 · Answer 3 · edited Jun 20 '20 at 09:12

The answer from @communitywiki misses out the question: "What I dont understand is that if the sample size is always the same as the input sample size than how can we talk about a random selection": It has to do with the nature of bootstrapping itself. Bootstrapping includes repeating the same values different times but still have same sample size as original data: Example (courtesy wiki page of Bootstrapping/Approach):

Original Sample : [1,2,3,4,5]
Boostrap 1 : [1,2,4,4,1]
Bootstrap 2: [1,1,3,3,5]

and so on.

This is how random selection can occur and still sample size can remain same.

score 0 · Answer 4 · answered Jan 31 '22 at 13:02

Although I am pretty new to python, I had a similar problem.

I tried to fit a RandomForestClassifier to my data. I splitted the data into train and test:

train_x, test_x, train_y, test_y = train_test_split(X, Y, test_size=0.2, random_state=0)

The length of the DFs were the same but after I predicted the model:

rfc_pred = rf_mod.predict(test_x)

The results had a different length.

To solve this I set the bootstrap option to false:

param_grid = {
    'bootstrap': [False],
    'max_depth': [110, 150, 200],
    'max_features': [3, 5],
    'min_samples_leaf': [1, 3],
    'min_samples_split': [6, 8],
    'n_estimators': [100, 200]
}

And ran the process all over again. It worked fine and I could calculate my confusion matrix. But I wish to understand how to use bootstrap and generate the predicted data with the same length.

How can SciKit-Learn Random Forest sub sample size may be equal to original training data size?

4 Answers4

Linked