0

I have 1 continuous feauture 'Tenure' and 1 categorical feature 'Nationality' in my sample. My sample observations have more than 50 different nationalities and 30 different tenures (0-30 years). In Spark ML, to identify which features are categorical you need to specify MaxCategories as below before creating a DecisionTreeClassifier model.

val featureIndexer = new VectorIndexer()
.setInputCol("features")
.setOutputCol("indexedFeatures")
.setMaxCategories(5)**
.fit(vecDF)

But In this case it does not work because 'Tenure' is continuous and has less distinct values than 'Nationalities'. Is there a way to specify which features are categorical as in spark MLlib? Thanks

val categoricalFeaturesInfo = Map[Int, Int]()
davidzxc574
  • 471
  • 1
  • 8
  • 21
  • A workaround would be converting 'Tenure' into a categorical feature but some info from the sample will be lost – davidzxc574 Feb 22 '19 at 06:38
  • You also can exclude continuous feature from the vector, index the vector and then add a continuous feature. Also, from documentation: `Decide which features should be categorical based on the number of distinct values, where features with at most maxCategories are declared categorical.` – addmeaning Feb 22 '19 at 08:55

0 Answers0