0

I have a question wrt. RandomForestRegressor in sklearn. It is about the max_features argument. Is my understanding correct that when one uses max_features='auto' all features are always considered at each split? I.e. with this, one yields the same as if BaggingRegressor from sklearn is used together with a base estimator = DecisionTreeRegressor?

What speaks against this interpretation is the sentence in the documentation: "The features are always randomly permuted at each split. Therefore, the best found split may vary, even with the same training data, max_features=n_features and bootstrap=False, ..." How is then this sentence meant?

Thanks for the help!

  • This might be helpful: https://stackoverflow.com/questions/23939750/understanding-max-features-parameter-in-randomforestregressor – j__carlson Aug 23 '21 at 18:45
  • Digging into the [tuning guide documentation](https://scikit-learn.org/stable/modules/ensemble.html#random-forest-parameters) seems to support this conclusion as well: "Empirical good default values are max_features=None ( _always_ considering _all features_ instead of a random subset) for regression problems..." (emphasis mine) – G. Anderson Aug 23 '21 at 18:45
  • Thanks saw stackoverflow.com/questions/23939750/ already but what about this sentence in the documentation, how can they vary when max_features=n_features and bootstrap=False? –  Aug 23 '21 at 18:52
  • @alphaH, when two splits tie in performance, which one is selected is probably based on the order of the columns. – Ben Reiniger Aug 23 '21 at 20:26
  • @Ben Reiniger: Thanks for that. So just to repeat: You mean when you have a simple decision tree (which you have basically with max_features=n_features and bootstrap=False) and you have for a specific split point say feature X and feature Y which would be equally good, then sometimes feature X is chosen sometimes feature Y, so you still have some sort of "randomness". Is this understanding correct? –  Aug 24 '21 at 17:40

0 Answers0