I want to tunning my model with grid search and cross validation with spark. In the spark, it must put the base model in a pipeline, the office demo of pipeline use the LogistictRegression
as an base model, which can be new as an object. However, the RandomForest
model cannot be new by client code, so it seems not be able to use RandomForest
in the pipeline api. I don't want to recreate an wheel, so can anybody give some advice?
Thanks
1 Answers
However, the RandomForest model cannot be new by client code, so it seems not be able to use RandomForest in the pipeline api.
Well, that is true but you simply trying to use a wrong class. Instead of mllib.tree.RandomForest
you should use ml.classification.RandomForestClassifier
. Here is an example based on the one from MLlib docs.
import org.apache.spark.ml.classification.RandomForestClassifier
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.StringIndexer
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.util.MLUtils
import sqlContext.implicits._
case class Record(category: String, features: Vector)
val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
val splits = data.randomSplit(Array(0.7, 0.3))
val (trainData, testData) = (splits(0), splits(1))
val trainDF = trainData.map(lp => Record(lp.label.toString, lp.features)).toDF
val testDF = testData.map(lp => Record(lp.label.toString, lp.features)).toDF
val indexer = new StringIndexer()
.setInputCol("category")
.setOutputCol("label")
val rf = new RandomForestClassifier()
.setNumTrees(3)
.setFeatureSubsetStrategy("auto")
.setImpurity("gini")
.setMaxDepth(4)
.setMaxBins(32)
val pipeline = new Pipeline()
.setStages(Array(indexer, rf))
val model = pipeline.fit(trainDF)
model.transform(testDF)
There is one thing I couldn't figure out here. As far as I can tell it should be possible to use labels extracted from LabeledPoints
directly, but for some reason it doesn't work and pipeline.fit
raises IllegalArgumentExcetion
:
RandomForestClassifier was given input with invalid label column label, without the number of classes specified.
Hence the ugly trick with StringIndexer
. After applying we get required attributes ({"vals":["1.0","0.0"],"type":"nominal","name":"label"}
) but some classes in ml
seem to work just fine without it.

- 322,348
- 103
- 959
- 935
-
RandomForestClassifier is new feature in spark v1.4, but my envrionment now is 1.3, I have to wait the end of this month for my spark cluster to update to 1.4 :-( – bourneli Aug 20 '15 at 09:13
-
With randomforest I've got the output, but I can't figure out how to read it: `(1.0 -> prediction=0.0, prob=[0.90,0.08]` `(0.0 -> prediction=1.0, prob=[0.01,0.88]` It looks like prediction here is just an index of the highest probability value. Which means I have to look at my classes to figure out what the actual class. The problem is that without "ugly hack" it's not possible to know these classes. Is it how that's suppose to work or did I missed anything? – evgenii Dec 04 '15 at 22:47
-
It is expected. ML expects 0 based labels (0.0, 1.0, 2.0) and predictions and probabilities reflect that. – zero323 Dec 05 '15 at 07:52
-
0-based is clear. The problem is different. I have 10 classes - from 0.0, 1.0, 2.0, up to 9.0. RandomForest gives me some prediction like this: `(4.0 -> prediction=7.0`, `9.0 -> prediction=3.0`, `2.0 -> prediction=4.0`, `8.0 -> prediction=8.0`. According to the docs, predicted is "predicted label", which doesn't look like that. But! Having map of classes like I've got from StringIndexer (`{"ml_attr":{"vals":["1.0","7.0","3.0","9.0","2.0","6.0","0.0","4.0","8.0","5.0"],"type":"nominal","name":"indexedLabel"}}`) and apply "prediction" just as index to the map I will get correct predictions! – evgenii Dec 05 '15 at 10:11
-
1Yes, thats correct. I am afraid that predicted label means output from `StringIndexer` not some-original-string and this is a correct / expected behavior. In 1.6.0 there is `IndexToString` transformer which can be used for inverse mapping. – zero323 Dec 05 '15 at 10:18