0

The documentation for Random Forest Classifier in Scikit-Learn says

A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and use averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default)

If the training set size X has n instances, then it seems like every sub-sample picked for each decision tree being trained will be of size n. Now if Bootstrap==True, the sample is taken with replacement and it seems there is some statistical benefit to picking a number of such samples.

However, if Bootstrap=False (sample picked with no replacement), that means every sample is identical to the training set? Is that a correct interpretation? If so, every tree gets the exact same sample? Why would this be considered an ensemble then?

desertnaut
  • 57,590
  • 26
  • 140
  • 166
yegodz
  • 5,031
  • 2
  • 17
  • 18
  • 3
    Note there's also a `max_features` parameter - each tree also gets a different set of features to work with (at each split, even). – Blorgbeard Jul 17 '18 at 00:14
  • 1
    It's a correct interpretation, but as @Blorgbeard says, the `max_features` is indeed the second key ingredient of RF (the other being the bootstrap sampling); this answer may be useful in clarifying things: [Why is Random Forest with a single tree much better than a Decision Tree classifier?](https://stackoverflow.com/questions/48239242/why-is-random-forest-with-a-single-tree-much-better-than-a-decision-tree-classif/48239653#48239653) (disclaimer: mine) – desertnaut Jul 17 '18 at 14:59

0 Answers0