Can sklearn random forest classifier handle categorical variables?

Question

I found this thread from 2014 and the answer states that no, sklearn random forest classifier cannot handle categorical variables (or at least not directly). Has the answer changed in 2020?

I want to feed gender as a feature for my model. However, gender can take on three values: M, F of np.nan. If I encode this column into three dichotomous columns, how can the random forest classifier know that these three columns represent a single feature?

Imagine max_features = 7. When training a given tree, it will randomly randomly pick seven features. Suppose gender was chosen. If gender is split into three columns (gender_M, gender_F, gender_NA), will the random forest classifier always pick all three columns and count it as one feature, or is there a chance that it will only pick one or two?

Any model can handle categorical data encoded properly (For ex. One0hot encoding) — Divyanshu Srivastava, Apr 30 '20 at 16:47
Yeah, but one hot encoding turns one column into multiple columns... — Arturo Sbr, Apr 30 '20 at 16:49
If only one of the columns is selected when training a tree, the tree will only make splits based on *one* category from then entire range of categories. — Arturo Sbr, Apr 30 '20 at 16:52
@DivyanshuSrivastava inflating the number of features is indeed an issue; I suggest you think it more closely — desertnaut, Apr 30 '20 at 16:53

yatu · Accepted Answer · 2020-04-30T17:09:46.630

1

If max_features is set to a value lower than the actual amount of columns (which is the advisable approach, see the recommended values for max_features in the docs), then yes, there is a chance that for a given estimator in the random forest only a subset of the dummy columns is considered.

But that is not necessarily too bad. In decision trees, a feature is selected as node at a given level aiming at optimizing some metric, independently from the other features, that is, only considering the actual feature and the target. So in a sense the model will not treat these dummy columns as belonging to the same feature.

In general though, the best approach for binary features is to come up with an appropriate method to fill missing values, and convert it into a single column encoded to 0s and 1s.

edited Apr 30 '20 at 17:09

answered Apr 30 '20 at 16:54

yatu

86,083
12
84
139

All correct, but a reminder should be arguably added that, as a rule, `max_features` is indeed set to a value (possibly much) lower than the total number of features. This in fact was one of the very innovative characteristics of RF; see [Why is Random Forest with a single tree much better than a Decision Tree classifier?](https://stackoverflow.com/questions/48239242/why-is-random-forest-with-a-single-tree-much-better-than-a-decision-tree-classif) – desertnaut Apr 30 '20 at 16:59
1

AFAIK randomness in the selection of features in the individual trees benefits the overall classification, since it brings down the bias. Though I can't see how that would be the case with a single estimator? I mean, I agree with your point, but looking into the post, I'm picturing an example where each feature is relevant, and taking a random sub-sample of these IMO should worsen the model. Maybe I'm missing something, just some thoughts on the linked post @desertnaut – yatu Apr 30 '20 at 17:07
1

As I mention explicitly in the linked answer, the fact that random feature selection alone improves performance is well-established. I agree that it's not very intuitive - perhaps it can be thought of (*very* roughly) as a "lasso-type" regularization. But the main point of my comment was not that, but the normal & recommended use of `max_features`, which I happily see you have incorporated in the answer ;) – desertnaut Apr 30 '20 at 17:17

Can sklearn random forest classifier handle categorical variables?

1 Answers1