How to use RandomForest in Spark Pipeline

Question

I want to tunning my model with grid search and cross validation with spark. In the spark, it must put the base model in a pipeline, the office demo of pipeline use the LogistictRegression as an base model, which can be new as an object. However, the RandomForest model cannot be new by client code, so it seems not be able to use RandomForest in the pipeline api. I don't want to recreate an wheel, so can anybody give some advice? Thanks

zero323 · Accepted Answer · 2015-08-20T06:00:05.077

However, the RandomForest model cannot be new by client code, so it seems not be able to use RandomForest in the pipeline api.

Well, that is true but you simply trying to use a wrong class. Instead of mllib.tree.RandomForest you should use ml.classification.RandomForestClassifier. Here is an example based on the one from MLlib docs.

import org.apache.spark.ml.classification.RandomForestClassifier
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.StringIndexer
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.util.MLUtils
import sqlContext.implicits._ 

case class Record(category: String, features: Vector)

val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
val splits = data.randomSplit(Array(0.7, 0.3))
val (trainData, testData) = (splits(0), splits(1))

val trainDF = trainData.map(lp => Record(lp.label.toString, lp.features)).toDF
val testDF = testData.map(lp => Record(lp.label.toString, lp.features)).toDF

val indexer = new StringIndexer()
  .setInputCol("category")
  .setOutputCol("label")

val rf  = new RandomForestClassifier()
    .setNumTrees(3)
    .setFeatureSubsetStrategy("auto")
    .setImpurity("gini")
    .setMaxDepth(4)
    .setMaxBins(32)

val pipeline = new Pipeline()
    .setStages(Array(indexer, rf))

val model = pipeline.fit(trainDF)

model.transform(testDF)

There is one thing I couldn't figure out here. As far as I can tell it should be possible to use labels extracted from LabeledPoints directly, but for some reason it doesn't work and pipeline.fit raises IllegalArgumentExcetion:

RandomForestClassifier was given input with invalid label column label, without the number of classes specified.

Hence the ugly trick with StringIndexer. After applying we get required attributes ({"vals":["1.0","0.0"],"type":"nominal","name":"label"}) but some classes in ml seem to work just fine without it.

RandomForestClassifier is new feature in spark v1.4, but my envrionment now is 1.3, I have to wait the end of this month for my spark cluster to update to 1.4 :-( — bourneli, Aug 20 '15 at 09:13
With randomforest I've got the output, but I can't figure out how to read it: `(1.0 -> prediction=0.0, prob=[0.90,0.08]` `(0.0 -> prediction=1.0, prob=[0.01,0.88]` It looks like prediction here is just an index of the highest probability value. Which means I have to look at my classes to figure out what the actual class. The problem is that without "ugly hack" it's not possible to know these classes. Is it how that's suppose to work or did I missed anything? — evgenii, Dec 04 '15 at 22:47
It is expected. ML expects 0 based labels (0.0, 1.0, 2.0) and predictions and probabilities reflect that. — zero323, Dec 05 '15 at 07:52
0-based is clear. The problem is different. I have 10 classes - from 0.0, 1.0, 2.0, up to 9.0. RandomForest gives me some prediction like this: `(4.0 -> prediction=7.0`, `9.0 -> prediction=3.0`, `2.0 -> prediction=4.0`, `8.0 -> prediction=8.0`. According to the docs, predicted is "predicted label", which doesn't look like that. But! Having map of classes like I've got from StringIndexer (`{"ml_attr":{"vals":["1.0","7.0","3.0","9.0","2.0","6.0","0.0","4.0","8.0","5.0"],"type":"nominal","name":"indexedLabel"}}`) and apply "prediction" just as index to the map I will get correct predictions! — evgenii, Dec 05 '15 at 10:11
Yes, thats correct. I am afraid that predicted label means output from `StringIndexer` not some-original-string and this is a correct / expected behavior. In 1.6.0 there is `IndexToString` transformer which can be used for inverse mapping. — zero323, Dec 05 '15 at 10:18

How to use RandomForest in Spark Pipeline

1 Answers1

Linked