bootstrap parameter in sklearn random forest

Question

Could someone explain the intuition of the parameter "bootstrap" for the random forest model?

When looking at the scikit-learn page https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html:

bootstrap bool, default=True

Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.

I am even more confused because I thought that random forest was already a technique using bootstrap so why is there this parameter to define ?

desertnaut · Answer 1 · 2021-12-15T12:43:54.110

Roughly speaking, bootstrap sampling is just sampling by replacement, which naturally leads to samples of the original dataset being left out, while other samples being present more than once.

I thought that random forest was already a technique using bootstrap

You are right in that the original RF algorithm as suggested by Breiman indeed incorporates bootstrap sampling by default (this is actually an inheritance from bagging, which is used in RF).

Nevertheless, implementations like the scikit-learn one, understandably prefer to leave available the option not to use bootstrap sampling (i.e. sampling with replacement), and use the whole dataset instead; from the docs:

The sub-sample size is controlled with the max_samples parameter if bootstrap=True (default), otherwise the whole dataset is used to build each tree.

Similar is the situation in the standard R implementation (here the respective parameter is called replace, and, like here, it's also set by default to TRUE).

So, nothing really strange here, beyond the (generally desirable) design choice of leaving room and flexibility for the practitioner to be able to select bootstrap sampling or not. In the RF early days, bootstrap sampling offered the extra possibility to calculate out-of-bag (OOB) error without using cross-validation, an idea that (I think...) eventually fell out of favor, and "freed" the practitioners to try leaving out the bootstrap sampling option, if this leads to better performance.

You may also find parts of my answer in Why is Random Forest with a single tree much better than a Decision Tree classifier? useful.

bootstrap parameter in sklearn random forest

1 Answers1