0

I have to use this code:

val dt = new DecisionTreeClassifier().setLabelCol("indexedLabel").setFeaturesCol("indexedFeatures").setImpurity(impurity).setMaxBins(maxBins).setMaxDepth(maxDepth);

I need to add categorical features information so that the decision tree doesn't treat the indexedCategoricalFeatures as numerical. I have this map:

val categoricalFeaturesInfo = Map(143 -> 126, 144 -> 5, 145 -> 216, 146 -> 100, 147 -> 14, 148 -> 8, 149 -> 19, 150 -> 7);

However it only works with DecisionTree.trainClassifier method. I can't use this method because it accepts different arguments than the one I have... I would really want to be able to use the DecisionTreeClassifier with categorical features treated properly.

Thank your for your help!

Community
  • 1
  • 1

1 Answers1

1

You're mixing two different APIs which take different approach to categorical data:

  • RDD based o.a.s.mllib which provides required metadata by passing categoricalFeaturesInfo map.
  • Dataset (DataFrame) o.a.s.ml which is using column metadata to determine variable types. If you correctly use ML transformers to create features this should be handled automatically for you, otherwise you'll have to provide metadata manually.
Community
  • 1
  • 1
zero323
  • 322,348
  • 103
  • 959
  • 935
  • Thank you. However, I don't see which ML transformer would handle automatically this for me. If you can point me out to it I will accept your response and +1 it as due! – user3553070 Aug 11 '16 at 17:00
  • This solely depends on your pipeline. Pretty much every ML transformer has specific semantics and will set some type of metadata on the schema. You'll just have to be aware of that and keep in mind that pipeline should reflect semantics of your data. Details are far to broad for SO IMHO. You can find some information [here](https://github.com/awesome-spark/spark-gotchas/blob/master/06_data_preparation.md) (disclaimer: I am co-author) but this is only a tip of the iceberg. – zero323 Aug 11 '16 at 17:22
  • Sorry to be asking this here, but do you have idea how to display the decision tree with string categories instead of indexed categories? Thank you. – user3553070 Aug 17 '16 at 19:50
  • If you're looking for out-of-the box AFAIK there is none at this moment. You can check http://stackoverflow.com/a/37311078/1560062 for example how to traverse tree and and adjust it with meta. – zero323 Aug 17 '16 at 21:09