80

Say I have a categorical feature, color, which takes the values

['red', 'blue', 'green', 'orange'],

and I want to use it to predict something in a random forest. If I one-hot encode it (i.e. I change it to four dummy variables), how do I tell sklearn that the four dummy variables are really one variable? Specifically, when sklearn is randomly selecting features to use at different nodes, it should either include the red, blue, green and orange dummies together, or it shouldn't include any of them.

I've heard that there's no way to do this, but I'd imagine there must be a way to deal with categorical variables without arbitrarily coding them as numbers or something like that.

Machavity
  • 30,841
  • 27
  • 92
  • 100
tkunk
  • 1,378
  • 1
  • 13
  • 19
  • This has been a useful and very long-standing enhance request on sklearn since 2014. One consideration was whether they should prioritize implementing the new [pandas Categorical](http://pandas.pydata.org/pandas-docs/stable/categorical.html) or generic numpy. – smci Nov 16 '16 at 12:42
  • Possible duplicate of [How to handle categorical variables in sklearn GradientBoostingClassifier?](https://stackoverflow.com/questions/24706677/how-to-handle-categorical-variables-in-sklearn-gradientboostingclassifier) – Zhiyong Jun 29 '18 at 23:08

6 Answers6

64

No, there isn't. Somebody's working on this and the patch might be merged into mainline some day, but right now there's no support for categorical variables in scikit-learn except dummy (one-hot) encoding.

Youwei Liang
  • 361
  • 4
  • 9
Fred Foo
  • 355,277
  • 75
  • 744
  • 836
  • 4
    Ten thumbs up if this ever finally gets implemented. Personally I'd prioritize pandas Categorical over plain numpy, but the core maintainers want otherwise. – smci Nov 16 '16 at 12:45
  • 4
    [Example of one-hot encoding in sklearn for handling categorical features](https://stackoverflow.com/a/24874515/1045085). – Zhiyong Apr 14 '18 at 22:53
  • `one-hot encoding` does not handle the categorical data the right way for `random forest`, you will get betters models than `one-hot encoding` just by turning creating arbitrary numbers for each category but that's not the right way either. You can easily see that by using `R` randomForest package which gives a totally different result, and it is not only by the `random` variance, you can repeat as much as you want the accuracy reached by`scikit-learn` using `one-hot encoding` is not even close to `R`'s randomForest package. – caiohamamura May 31 '22 at 16:07
  • Is it still the case that someone is working on this? Or is this request finished now? I know for lightgbm package it can handle categorical data if passed as a pandas .astype('category'). I would have thought if lgbm has logic to handle cateogical then random forest probably should also. – Thomas Leyshon Oct 17 '22 at 21:03
30

Most implementations of random forest (and many other machine learning algorithms) that accept categorical inputs are either just automating the encoding of categorical features for you or using a method that becomes computationally intractable for large numbers of categories.

A notable exception is H2O. H2O has a very efficient method for handling categorical data directly which often gives it an edge over tree based methods that require one-hot-encoding.

This article by Will McGinnis has a very good discussion of one-hot-encoding and alternatives.

This article by Nick Dingwall and Chris Potts has a very good discussion about categorical variables and tree based learners.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
denson
  • 2,366
  • 2
  • 24
  • 25
  • 5
    Another notable exception as of recently is LightGBM https://lightgbm.readthedocs.io/en/latest/Features.html#optimal-split-for-categorical-features with objective='rf' – zkurtz Apr 04 '19 at 04:54
  • 1
    You should add this as a separate answer! – denson Apr 06 '19 at 04:28
15

You have to make the categorical variable into a series of dummy variables. That is how sklearn works.

If you are using pandas, use pd.get_dummies, it works really well.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Hemanth Kondapalli
  • 1,272
  • 12
  • 7
  • 11
    It works really well if the same unique values are present in training and inference, therefore it's not reliable. – marbel Dec 14 '16 at 02:25
  • 5
    It's not just annoying, it's suboptimal. Random Forests perform worse when using dummy variables. See the following quote from this [article](https://roamanalytics.com/2016/10/28/are-categorical-variables-getting-lost-in-your-random-forests/): `Imagine our categorical variable has 100 levels, each appearing about as often as the others. The best the algorithm can expect to do by splitting on one of its one-hot encoded dummies is to reduce impurity by ≈ 1%, since each of the dummies will be 'hot' for around 1% of the samples.` – James Mchugh Dec 10 '19 at 13:55
2

No. There are 2 types of categorical features:

  1. Ordinal: use OrdinalEncoder
  2. Cardinal: use LabelEncoder or OnehotEncoder

Note: Differences between LabelEncoder & OnehotEncoder:

  1. Label: only for one column => usually we use it to encode the label column (i.e., the target column)
  2. Onehot: for multiple columns => can handle more features at one time
frr0717
  • 41
  • 6
0

Maybe you can use 1~4 to replace these four color, that is, it is the number rather than the color name in that column. And then the column with number can be used in the models

  • 4
    The answer is not correct. Replacing colors with 1-4 numbers will misguide the tree-based model. If we can simply do that as you suggested, we would have never required one-hot encoding. – Marvania Mehul Apr 08 '21 at 09:26
-3

You can directly feed categorical variables to random forest using below approach:

  1. Firstly convert categories of feature to numbers using sklearn label encoder
  2. Secondly convert label encoded feature type to string(object)
le=LabelEncoder()
df[col]=le.fit_transform(df[col]).astype('str')

above code will solve your problem

pmadhu
  • 3,373
  • 2
  • 11
  • 23