How to handle categorical features for Decision Tree, Random Forest in spark ml?

Question

I am trying to build decision tree and random forest classifier on the UCI bank marketing data -> https://archive.ics.uci.edu/ml/datasets/bank+marketing. There are many categorical features (having string values) in the data set.

In the spark ml document, it's mentioned that the categorical variables can be converted to numeric by indexing using either StringIndexer or VectorIndexer. I chose to use StringIndexer (vector index requires vector feature and vector assembler which convert features to vector feature accepts only numeric type ). Using this approach, each of the level of a categorical feature will be assigned numeric value based on it's frequency (0 for most frequent label of a category feature).

My question is how the algorithm of Random Forest or Decision Tree will understand that new features (derived from categorical features) are different than continuous variable. Will indexed feature be considered as continuous in the algorithm? Is it the right approach? Or should I go ahead with One-Hot-Encoding for categorical features.

I read some of the answers from this forum but i didn't get clarity on the last part.

vdep · Accepted Answer · 2017-07-07T10:14:12.867

One Hot Encoding should be done for categorical variables with categories > 2.

To understand why, you should know the difference between the sub categories of categorical data: Ordinal data and Nominal data.

Ordinal Data: The values has some sort of ordering between them. example: Customer Feedback(excellent, good, neutral, bad, very bad). As you can see there is a clear ordering between them (excellent > good > neutral > bad > very bad). In this case StringIndexer alone is sufficient for modelling purpose.

Nominal Data: The values has no defined ordering between them. example: colours(black, blue, white, ...). In this case StringIndexer alone is NOT sufficient. and One Hot Encoding is required after String Indexing.

After String Indexing lets assume the output is:

 id | colour   | categoryIndex
----|----------|---------------
 0  | black    | 0.0
 1  | white    | 1.0
 2  | yellow   | 2.0
 3  | red      | 3.0

Then without One Hot Encoding, the machine learning algorithm will assume: red > yellow > white > black, which we know its not true. OneHotEncoder() will help us avoid this situation.

So to answer your question,

Will indexed feature be considered as continuous in the algorithm?

It will be considered as continious variable.

Is it the right approach? Or should I go ahead with One-Hot-Encoding for categorical features

depends on your understanding of data.Although Random Forest and some boosting methods doesn't require OneHot Encoding, most ML algorithms need it.

Refer: https://spark.apache.org/docs/latest/ml-features.html#onehotencoder

Thanks for your detailed response. I am more concerned for nominal data. In spark ml, i can't feed string values as it is for random forest. I need to convert it into numeric values. When i convert it to numeric, the order of the values will not make sense, so looks like i will have to do one-hot-encoding for nominal categorical feature for random forest too. So, your response "depends on your understanding of data.Although Random Forest and some boosting methods doesn't require OneHot Encoding, most ML algorithms need it" doesn't apply for nominal data. Let me know if you disagree. — user6200992, Jul 07 '17 at 14:49
Yes, if you know that a particular column contains nominal data, then do one hot encoding. If it is an ordinal data you can do StringIndexing. (although it is not wrong to do one hot encoding for ordinal data) — vdep, Jul 07 '17 at 17:05
To my understating, tree based algorithms (i.e. Random Forest, XGBoost, etc) do not require one hot encoding of categorical variables. However, distance based algorithms such as `logistic regression` (or any type of regression method which uses least squares method) and `support vector machines` needs to be one hot encoded. Which, is explained above by @vdep. If am wrong please correct. Thanks — Alain Michael Janith Schroter, Apr 21 '19 at 02:55

score 4 · Answer 2 · answered Nov 20 '17 at 19:38

In short, Spark's RandomForest does NOT require OneHotEncoder for categorical features created by StringIndexer or VectorIndexer.

Longer Explanation. In general DecisionTrees can handle both Ordinal and Nominal types of data. However, when it comes to the implementation, it could be that OneHotEncoder is required (as it is in Python's scikit-learn).
Luckily, Spark's implementation of RandomForest honors categorical features if properly handled and OneHotEncoder is NOT required! Proper handling means that categorical features contain the corresponding metadata so that RF knows what it is working on. Features that have been created by StringIndexer or VectorIndexer contain metadata in the DataFrame about being generated by the Indexer and being categorical.

It makes sense, but is it still valid if a next step takes features created with StringIndexer and VectorAssembler(for numeric features) and then create a vector column from those 2 on another VectorAssembler? My worry is that when creating the full vector column from categorical + numeric, the metadata for categorical gets lost — Luis Leal, Mar 16 '22 at 22:55

score 1 · Answer 3 · answered Oct 26 '18 at 03:40

According to the vdep answers, the StringIndexer is enough for Ordinal Data. Howerver the StringIndexer sort the data by label frequency, for example "excellent > good > neutral > bad > very bad" maybe become the "good,excellent,neutral". So for Oridinal data, the StringIndexer do not suit for it.

Secondly, for Nominal Data, the document tells us that

for a binary classification problem with one categorical feature with three categories A, B and C whose corresponding proportions of label 1 are 0.2, 0.6 and 0.4, the categorical features are ordered as A, C, B. The two split candidates are A | C, B and A , C | B where | denotes the split.

The "corresponding proportions of label 1" is same as the label frequency? So I am confused of the feasibility with the StringInder to DecisionTree in Spark.

How to handle categorical features for Decision Tree, Random Forest in spark ml?

3 Answers3

Linked