Categorical and numerical features in Spark MlLib (Java)

Question

I am working with Apache Spark MlLib version 2.11 in Java. I need to pass to the RandomForestClassifier both categorical and numerical features (strings and numbers).

What is the best API to use for such a case? An example would be very helpful.

Edit

I tried to use the VectorIndexer, but it accepts only numbers and I couldn't understand how to integrate OneHotEncoder to it. Also, I it's not clear how to tell which features are categorical and which are numerical. Where do I need to set all possible categories?

Here is some code I tried:

StructType schema = DataTypes.createStructType(new StructField[] {
        new StructField("label", DataTypes.StringType, false, Metadata.empty()),
        new StructField("features", new ArrayType(DataTypes.StringType, false), false,
                Metadata.empty()),
});

JavaRDD<Row> rowRDD = trainingData.map(record -> {
    List<String> values = new ArrayList<>();
    for (String field : fields) {
        values.add(record.get(field));
    }
    return RowFactory.create(record.get(Constants.GROUND_TRUTH), values.toArray(new String[0]));
});

Dataset<Row> trainingDataDataframe = spark.createDataFrame(rowRDD, schema);

StringIndexerModel labelIndexer = new StringIndexer()
        .setInputCol("label")
        .setOutputCol("indexedLabel")
        .fit(trainingDataDataframe);

OneHotEncoder encoder = new OneHotEncoder()
        .setInputCol("features")
        .setOutputCol("featuresVec");
Dataset<Row> encoded = encoder.transform(trainingDataDataframe);

VectorIndexerModel featureIndexer = new VectorIndexer()
        .setInputCol("featuresVec")
        .setOutputCol("indexedFeatures")
        .setMaxCategories(maxCategories)
        .fit(encoded);

StringIndexerModel featureIndexer = new StringIndexer()
        .setInputCol("features")
        .setOutputCol("indexedFeatures")
        .fit(encoded);

RandomForestClassifier rf = new RandomForestClassifier();
        .setNumTrees(numTrees);
        .setFeatureSubsetStrategy(featureSubsetStrategy);
        .setImpurity(impurity);
        .setMaxDepth(maxDepth);
        .setMaxBins(maxBins);
        .setSeed(seed)
        .setLabelCol("indexedLabel")
        .setFeaturesCol("indexedFeatures");

IndexToString labelConverter = new IndexToString()
        .setInputCol("prediction")
        .setOutputCol("predictedLabel")
        .setLabels(labelIndexer.labels());

Pipeline pipeline = new Pipeline()
        .setStages(new PipelineStage[] {labelIndexer, featureIndexer, rf, labelConverter});

PipelineModel model = pipeline.fit(encoded);

That answer is in scala but you've asked for an API. And the API is the same. — eliasah, May 30 '18 at 09:21
The question referenced is about spark-ml, and not spark-mllib. Furthermore, it shows how to handle a feature as categorical, and not how to use both categorical and numerical features together. Please reconsider un-marking the question as a duplicate. — Alex Bousso, May 30 '18 at 09:45

score 2 · Answer 1 · answered May 31 '18 at 08:35

A Random Forest, like a Decision Tree, does not need One Hot encoding to manage categorical features, it is one of the few techniques that can manage categorical features natively (that is, without a transformation to binary features, that is the purpose of one hot encoding).

The easiest way to deal with continuous and categorical features at the same time is to set the maxCategories parameter properly. When you'll train your forest, the distinct values of each feature will be count, and columns with less than maxCategories distinct values in the training data will be considered categorical.

You can check that the feature is categorical by printing the tree/forest, with toDebugString. If it's categorical you will see something like if feature0 in {0,1,2} instead of the usual <=.

Thanks for your answer, but maxCategories is a bit problematic, because I may have more categories than numbers in the numerical features, so, in this case, the numerical features will be treated as categorical. I actually succeded to build the model, please refer to the answer I posted. — Alex Bousso, Jun 04 '18 at 14:59

score 1 · Accepted Answer · answered Jun 04 '18 at 14:56

I found a solution to the issue. I upgraded the version of Spark MlLib to 2.3.0. In this version they included a class named OneHotEncoderEstimator. It has an input of all the categorical columns (Doubles) and outputs the respective vectors.

Then I used the VectorAssembler class to unify all the features (numerical and categorical) into one vector, which I sent to the RandomForestClassifier.

Categorical and numerical features in Spark MlLib (Java)

2 Answers2