I found this thread from 2014 and the answer states that no, sklearn random forest classifier cannot handle categorical variables (or at least not directly). Has the answer changed in 2020?
I want to feed gender
as a feature for my model. However, gender
can take on three values: M
, F
of np.nan
. If I encode this column into three dichotomous columns, how can the random forest classifier know that these three columns represent a single feature?
Imagine max_features
= 7. When training a given tree, it will randomly randomly pick seven features. Suppose gender
was chosen. If gender
is split into three columns (gender_M
, gender_F
, gender_NA
), will the random forest classifier always pick all three columns and count it as one feature, or is there a chance that it will only pick one or two?