Spark ML DecisionTreeClassifier to Identify Categorical Features

Question

I have 1 continuous feauture 'Tenure' and 1 categorical feature 'Nationality' in my sample. My sample observations have more than 50 different nationalities and 30 different tenures (0-30 years). In Spark ML, to identify which features are categorical you need to specify MaxCategories as below before creating a DecisionTreeClassifier model.

val featureIndexer = new VectorIndexer()
.setInputCol("features")
.setOutputCol("indexedFeatures")
.setMaxCategories(5)**
.fit(vecDF)

But In this case it does not work because 'Tenure' is continuous and has less distinct values than 'Nationalities'. Is there a way to specify which features are categorical as in spark MLlib? Thanks

val categoricalFeaturesInfo = Map[Int, Int]()

A workaround would be converting 'Tenure' into a categorical feature but some info from the sample will be lost — davidzxc574, Feb 22 '19 at 06:38
You also can exclude continuous feature from the vector, index the vector and then add a continuous feature. Also, from documentation: `Decide which features should be categorical based on the number of distinct values, where features with at most maxCategories are declared categorical.` — addmeaning, Feb 22 '19 at 08:55

Spark ML DecisionTreeClassifier to Identify Categorical Features

0 Answers0