I am working with Apache Spark MlLib version 2.11 in Java. I need to pass to the RandomForestClassifier both categorical and numerical features (strings and numbers).
What is the best API to use for such a case? An example would be very helpful.
Edit
I tried to use the VectorIndexer, but it accepts only numbers and I couldn't understand how to integrate OneHotEncoder to it. Also, I it's not clear how to tell which features are categorical and which are numerical. Where do I need to set all possible categories?
Here is some code I tried:
StructType schema = DataTypes.createStructType(new StructField[] {
new StructField("label", DataTypes.StringType, false, Metadata.empty()),
new StructField("features", new ArrayType(DataTypes.StringType, false), false,
Metadata.empty()),
});
JavaRDD<Row> rowRDD = trainingData.map(record -> {
List<String> values = new ArrayList<>();
for (String field : fields) {
values.add(record.get(field));
}
return RowFactory.create(record.get(Constants.GROUND_TRUTH), values.toArray(new String[0]));
});
Dataset<Row> trainingDataDataframe = spark.createDataFrame(rowRDD, schema);
StringIndexerModel labelIndexer = new StringIndexer()
.setInputCol("label")
.setOutputCol("indexedLabel")
.fit(trainingDataDataframe);
OneHotEncoder encoder = new OneHotEncoder()
.setInputCol("features")
.setOutputCol("featuresVec");
Dataset<Row> encoded = encoder.transform(trainingDataDataframe);
VectorIndexerModel featureIndexer = new VectorIndexer()
.setInputCol("featuresVec")
.setOutputCol("indexedFeatures")
.setMaxCategories(maxCategories)
.fit(encoded);
StringIndexerModel featureIndexer = new StringIndexer()
.setInputCol("features")
.setOutputCol("indexedFeatures")
.fit(encoded);
RandomForestClassifier rf = new RandomForestClassifier();
.setNumTrees(numTrees);
.setFeatureSubsetStrategy(featureSubsetStrategy);
.setImpurity(impurity);
.setMaxDepth(maxDepth);
.setMaxBins(maxBins);
.setSeed(seed)
.setLabelCol("indexedLabel")
.setFeaturesCol("indexedFeatures");
IndexToString labelConverter = new IndexToString()
.setInputCol("prediction")
.setOutputCol("predictedLabel")
.setLabels(labelIndexer.labels());
Pipeline pipeline = new Pipeline()
.setStages(new PipelineStage[] {labelIndexer, featureIndexer, rf, labelConverter});
PipelineModel model = pipeline.fit(encoded);