39

While constructing each tree in the random forest using bootstrapped samples, for each terminal node, we select m variables at random from p variables to find the best split (p is the total number of features in your data). My questions (for RandomForestRegressor) are:

1) What does max_features correspond to (m or p or something else)?

2) Are m variables selected at random from max_features variables (what is the value of m)?

3) If max_features corresponds to m, then why would I want to set it equal to p for regression (the default)? Where is the randomness with this setting (i.e., how is it different from bagging)?

Thanks.

Ravindra S
  • 6,302
  • 12
  • 70
  • 108
csankar69
  • 657
  • 1
  • 6
  • 13

3 Answers3

23

Straight from the documentation:

[max_features] is the size of the random subsets of features to consider when splitting a node.

So max_features is what you call m. When max_features="auto", m = p and no feature subset selection is performed in the trees, so the "random forest" is actually a bagged ensemble of ordinary regression trees. The docs go on to say that

Empirical good default values are max_features=n_features for regression problems, and max_features=sqrt(n_features) for classification tasks

By setting max_features differently, you'll get a "true" random forest.

Fred Foo
  • 355,277
  • 75
  • 744
  • 836
  • 6
    So why do they claim that "Empirical good default values are max_features=n_features for regression problems"? As you say this is just bagging - isn't random forest supposed to be better than bagging? – csankar69 May 30 '14 at 18:32
  • @csankar69: I'm no expert in regression trees. I did work on the RFs because I use them for classification, and I can assure you their author is knowledgeable in these matters. In any case, you can check for yourself whether attribute bagging helps for your problem. – Fred Foo May 30 '14 at 19:36
  • 7
    I'm 95% sure the max_features=n_features for regression is a mistake on scikit's part. The original paper for RF gave max_features = n_features/3 for regression. It wouldn't even make sense to use the former, it's not even a RF. – Ulysse Mizrahi Aug 26 '16 at 09:19
  • 5
    Thanks @UlysseMizrah for the comment! [The Wikipedia article](https://en.wikipedia.org/wiki/Random_forest#From_bagging_to_random_forests) references *The Elements of Statistical Learning: 2nd Edition* (Hastie et. al. 2009, p. 592), which reports that the original authors recommend n_features / 3. I thought I'd post these additional references just in case someone was interested. – Matthew Gunn May 20 '17 at 20:04
  • @MatthewGunn so absent any justification from the sklearn author, for `RandomForestRegressor`, we "should" have that `max_features="auto"` ⇒ `max_features=int(n_features / 3.0)` ? – Matt Hancock Aug 04 '18 at 13:40
  • The 'per split' language has always confused me. max_features should be the number of features that are considered when constructing each decision tree, not just on a per-split level, right? I.e. each decision tree is made with the same features, instead of sampling different features per split. – lynnyi Jan 25 '19 at 23:49
  • 3
    The question whether `max_features=n_features` makes a good default is discussed in depth on https://stats.stackexchange.com/q/324370/295421 and https://github.com/scikit-learn/scikit-learn/issues/7254 – claasz Nov 03 '20 at 16:27
  • So each tree in random forest select subset of features and from there use max_features for each node (in this case max_feature=n_features should still be considered RF) or RF refers to selecting all features then using subset of features at each node? – haneulkim Jan 16 '21 at 15:59
  • Ah I found the answer, from https://sebastianraschka.com/faq/docs/random-forest-feature-subsets.html for those interested. – haneulkim Jan 16 '21 at 16:12
9

@lynnyi, max_features is the number of features that are considered on a per-split level, rather than on the entire decision tree construction. More clear, during the construction of each decision tree, RF will still use all the features (n_features), but it only consider number of "max_features" features for node splitting. And the "max_features" features are randomly selected from the entire features. You could confirm this by plotting one decision tree from a RF with max_features=1, and check all the nodes of that tree to count the number of features involved.

Zhendong Cao
  • 139
  • 1
  • 5
  • 3
    This is more like a comment to another comment than an answer to the question. – slfan May 29 '19 at 16:38
  • 3
    Sorry but right now I have less than 50 reputation to comment. – Zhendong Cao May 29 '19 at 17:50
  • wait so each tree in random forest actually uses all features however randomly select subset of them at each node? Or does each tree take subset of features and from there take max_features at each node? – haneulkim Jan 16 '21 at 15:56
  • @ Ambleu, "each tree in random forest actually uses all features however randomly select subset of them at each node" is the correct one. – Zhendong Cao Jan 16 '21 at 18:23
1

max_features is basically the number of features selected at random and without replacement at split. Suppose you have 10 independent columns or features, then max_features=5 will select at random and without replacement 5 features at every split.

Mainak Sen
  • 63
  • 6