110

There are several posts about how to encode categorical data to Sklearn Decision trees, but from Sklearn documentation, we got these

Some advantages of decision trees are:

(...)

Able to handle both numerical and categorical data. Other techniques are usually specialized in analyzing datasets that have only one type of variable. See the algorithms for more information.

But running the following script

import pandas as pd 
from sklearn.tree import DecisionTreeClassifier

data = pd.DataFrame()
data['A'] = ['a','a','b','a']
data['B'] = ['b','b','a','b']
data['C'] = [0, 0, 1, 0]
data['Class'] = ['n','n','y','n']

tree = DecisionTreeClassifier()
tree.fit(data[['A','B','C']], data['Class'])

outputs the following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/site-packages/sklearn/tree/tree.py", line 154, in fit
    X = check_array(X, dtype=DTYPE, accept_sparse="csc")
  File "/usr/local/lib/python2.7/site-packages/sklearn/utils/validation.py", line 377, in check_array
    array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: could not convert string to float: b

I know that in R it is possible to pass categorical data, with Sklearn, is it possible?

0xhfff
  • 1,215
  • 2
  • 9
  • 6
  • Related questions in the SE network: [Question 32622 in data-science](https://datascience.stackexchange.com/q/32622) and [Question 5226 in data-science](https://datascience.stackexchange.com/q/5226) – Carlos Pinzón May 11 '22 at 13:07

8 Answers8

109

(This is just a reformat of my comment from 2016...it still holds true.)

The accepted answer for this question is misleading.

As it stands, sklearn decision trees do not handle categorical data - see issue #5442.

The recommended approach of using Label Encoding converts to integers which the DecisionTreeClassifier() will treat as numeric. If your categorical data is not ordinal, this is not good - you'll end up with splits that do not make sense.

Using a OneHotEncoder is the only current valid way, allowing arbitrary splits not dependent on the label ordering, but is computationally expensive.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
James Owers
  • 7,948
  • 10
  • 55
  • 71
  • 6
    OneHotEncoding can deteriorate the performance of decision trees apparently as it leads to extremely sparse features, which can mess up feature importances https://roamanalytics.com/2016/10/28/are-categorical-variables-getting-lost-in-your-random-forests/ – Arun Jan 29 '20 at 21:49
  • 2
    Agreed - I'm not recommending this approach, but it's the only way to avoid the issue I describe at present. – James Owers Feb 24 '20 at 16:49
  • 2
    I suspect there are instances (with features with many small levels) where the "nonsense" splits on an ordinally-encoded categorical feature nevertheless produce better performance than the very limited splits on the one-hot-encoded feature. – Ben Reiniger Jul 20 '20 at 21:44
  • 1
    is there any other implementation of Decision tree classifier which can handle this? – Modem Rakesh goud Aug 06 '20 at 10:35
  • 2
    To update: this Pull Request (and the discussion within) may be of interest: https://github.com/scikit-learn/scikit-learn/pull/12866 – James Owers Apr 21 '21 at 13:42
  • So what to do if my column is ordinal and has high cardinality? and cannot bin ordinal values, in such case I just need to use different model? – haneulkim Aug 30 '21 at 09:28
  • @haneulkim If it is ordinal you have no issue - you can treat your categories as integers just fine. – James Owers Sep 13 '21 at 14:55
  • @JamesOwers Ah my mistake, I meant to say non-ordinal column with high cardinality – haneulkim Sep 14 '21 at 00:52
23

Able to handle both numerical and categorical data.

This only means that you can use

  • the DecisionTreeClassifier class for classification problems
  • the DecisionTreeRegressor class for regression.

In any case you need to one-hot encode categorical variables before you fit a tree with sklearn, like so:

import pandas as pd
from sklearn.tree import DecisionTreeClassifier

data = pd.DataFrame()
data['A'] = ['a','a','b','a']
data['B'] = ['b','b','a','b']
data['C'] = [0, 0, 1, 0]
data['Class'] = ['n','n','y','n']

tree = DecisionTreeClassifier()

one_hot_data = pd.get_dummies(data[['A','B','C']],drop_first=True)
tree.fit(one_hot_data, data['Class'])
desertnaut
  • 57,590
  • 26
  • 140
  • 166
Guillaume
  • 3,471
  • 1
  • 9
  • 14
  • 3
    You may want to play around 'pd.get_dummies' , for example the option 'drop_first = True' could help to avoid multicolinearity problems. [Here](https://www.youtube.com/watch?v=0s_1IsROgDc) there is a nice tutorial. – Rafael Valero Aug 23 '18 at 13:17
9

For nominal categorical variables, I would not use LabelEncoderbut sklearn.preprocessing.OneHotEncoder or pandas.get_dummies instead because there is usually no order in these type of variables.

5

As of v0.24.0, scikit supports the use of categorical features in HistGradientBoostingClassifier and HistGradientBoostingRegressor natively!

To enable categorical support, a boolean mask can be passed to the categorical_features parameter, indicating which feature is categorical. In the following, the first feature will be treated as categorical and the second feature as numerical:

>>> gbdt = HistGradientBoostingClassifier(categorical_features=[True, False])

Equivalently, one can pass a list of integers indicating the indices of the categorical features:

>>> gbdt = HistGradientBoostingClassifier(categorical_features=[0])

You still need to encode your strings, otherwise you will get "could not convert string to float" error. See here for an example on using OrdinalEncoder to convert strings to integers.

Bora M. Alper
  • 3,538
  • 1
  • 24
  • 35
  • 1
    Sorry for the ignorant question, but does it have to do with Decision Trees? If so, can you please provide an example on how can we now use categorical variables with Decision Tree (I am a noob...)? – umbe1987 Mar 23 '21 at 14:18
  • 1
    This is gradient boosting. OP is asking for decision tree. – remykarem Mar 28 '21 at 12:24
2

Yes decision tree is able to handle both numerical and categorical data. Which holds true for theoretical part, but during implementation, you should try either OrdinalEncoder or one-hot-encoding for the categorical features before training or testing the model. Always remember that ml models don't understand anything other than Numbers.

jaabir
  • 59
  • 6
1

Sklearn Decision Trees do not handle conversion of categorical strings to numbers. I suggest you find a function in Sklearn (maybe this) that does so or manually write some code like:

def cat2int(column):
    vals = list(set(column))
    for i, string in enumerate(column):
        column[i] = vals.index(string)
    return column
mrwyatt
  • 183
  • 6
  • Yeah, that wat a usually do, but for printing it is not really good. – 0xhfff Jun 29 '16 at 20:00
  • 1
    If you want to go from integer back to string representation, make a dictionary that holds the mapping between string and integer and use that to "decode" the integer representation. – mrwyatt Jun 29 '16 at 20:01
  • 2
    The statement is inaccurate. Scikit-learn classifiers don't implicitly handle label encoding. However, Scikit-learn provides a lot of classes to handle this. I would recommend using scikit learn tools because they can also be fit in a Machine Learning Pipeline with minimal effort. – Abhinav Arora Jun 29 '16 at 20:46
-1

With sklearn classifiers, you can model categorical variables both as an input and as an output.

Let's assume you have categorical predictors and categorical labels (i.e. multi-class classification task). Moreover, you want to handle missing or unknown labels for both predictors and labels.

First thing you need encoder like OrdinalEncoder.

Basic example:

# encoders
from sklearn.preprocessing import OrdinalEncoder

input_enc = OrdinalEncoder(unknown_value=-1, handle_unknown='use_encoded_value', encoded_missing_value=-1)
output_enc = OrdinalEncoder(unknown_value=-1, handle_unknown='use_encoded_value', encoded_missing_value=-1 )

input_enc.fit(df[['Attribute A','Attribute B']].values)
output_enc.fit(df[['Label']].values)


# build classifier
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(random_state=0)

X = input_enc.transform(df[['Attribute A','Attribute B']].values)
Y = output_enc.transform(df[['Label']].values)

clf.fit(X, Y)

# predict

predicted = clf.predict(input_enc.transform([('Value 1', 'Value 2')]))
predicted_label = output_enc.inverse_transform([predicted])

If you use df[...].values, your encoder will not store attribute names (column names). This does not matter, as long as same format is used for enc.transform() or enc.inverse_transofrm() (otherwise you will a warning).

OrdinalEncoder by default does not handle nan values and they are not handled by cls.fit(). This is solved by encoded_missing_value param.

In prediction phase, by default encoder will throw an error when ask to transform unknown labels. This is handled by handle_unknown param.

Pawel
  • 900
  • 2
  • 10
  • 19
  • I see this answer was hated, despite it follows official scikit-learn documentation :https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-categorical-features – Pawel May 15 '23 at 17:47
-7

I would prefer to use tools provided by Scikit-Learn for this purpose. The main reason for doing so is that they can be easily integrated in a Pipeline.

Scikit-Learn itself provides very good classes to handle categorical data. Instead of writing your custom function, you should use LabelEncoder which is specially designed for this purpose.

Refer to the following code from the documentation:

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(["paris", "paris", "tokyo", "amsterdam"])
le.transform(["tokyo", "tokyo", "paris"]) 

This automatically encodes them into numbers for your machine learning algorithms. Now this also supports going back to strings from integers. You can do this by simply calling inverse_transform as follows:

list(le.inverse_transform([2, 2, 1]))

This would return ['tokyo', 'tokyo', 'paris'].

Also note that for many other classifiers, apart from decision trees, such as logistic regression or SVM, you would like to encode your categorical variables using One-Hot encoding. Scikit-learn supports this as well through the OneHotEncoder class.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Abhinav Arora
  • 3,281
  • 18
  • 20
  • 178
    -1 this is misleading. As it stands, sklearn decision trees do not handle categorical data - [see issue #5442](https://github.com/scikit-learn/scikit-learn/issues/5442). This approach of using Label Encoding converts to integers which the `DecisionTreeClassifier()` **will treat as numeric**. If your categorical data is not ordinal, this is not good - you'll end up with splits that do not make sense. Using a `OneHotEncoder` is the only current valid way, but is computationally expensive. – James Owers Oct 06 '16 at 22:09
  • @Abhinav, Is it possible to apply the `LabelEncoder` on more than one column of a dataframe at once? For instance, in the dataframe from the question, can we do something like `le.fit_transform(data[['A','B','C']])` to get labels for all categorical columns at once? Or should be specify the categorical columns explicitly to convert just the categorical columns. – Minu May 04 '17 at 16:06
  • @kungfujam, Also, I would like to `One-Hot Encode` the categorical columns once I `LabelEncode` them - to address the issue that @kungfujam pointed out. How can I do that once I get the label encoding done? – Minu May 04 '17 at 16:08
  • 23
    This is highly misleading. Please don't convert strings to numbers and use in decision trees. There is no way to handle categorical data in scikit-learn. One option is to use the decision tree classifier in Spark - in which you can explicitly declare the categorical features and their ordinality. Refer here for more details https://github.com/scikit-learn/scikit-learn/pull/4899 – Pradeep Banavara Jan 23 '18 at 11:16
  • 7
    Everybody must learn Scales of Measurement viz Nominal, Ordinal, Interval and Ratio scales. Number doesn't mean it is numerical in Nominal scale; it is just a flag. For example we may use 1 for Red, 2 for Blue and 3 for Green. Let's say 10 persons preferred Red and 10 preferred Green. Does it make sense to calculate the mean ((10*1+10*3)/20 = 2) and state that on an average preference is for Blue?? – Regi Mathew Apr 05 '19 at 12:35
  • @JamesOwers it's definitely worth turning your comment to an answer. – ayorgo Jul 02 '19 at 14:57
  • 2
    Er...I had no idea it had that much attention. Cheers @ayorgo, will do! – James Owers Jul 02 '19 at 17:10
  • 1
    So this is why my intern candidates have no clue on how to treat categorical variables. – Daniel Severo Dec 09 '19 at 20:24
  • +1, I think this answer is actually valid and does not deserve this criticism. Tree-based algorithms actually not only can use numerically-encoded cat. variables, but even tend to have better accuracy (and computational performance) than using OHE. It does not matter if the category is ordinal and the splits "do not make sense", the trees will figure out splits that produce good behavior. See https://towardsdatascience.com/one-hot-encoding-is-making-your-tree-based-ensembles-worse-heres-why-d64b282b5769 or https://spark.apache.org/docs/latest/mllib-decision-tree.html – Melkor.cz Jun 30 '23 at 08:42