I want to evaluate a random forest being trained on some data. Is there any utility in Apache Spark to do the same or do I have to perform cross validation manually?
Asked
Active
Viewed 2.1k times
2 Answers
39
ML provides CrossValidator
class which can be used to perform cross-validation and parameter search. Assuming your data is already preprocessed you can add cross-validation as follows:
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.tuning.{ParamGridBuilder, CrossValidator}
import org.apache.spark.ml.classification.RandomForestClassifier
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
// [label: double, features: vector]
trainingData org.apache.spark.sql.DataFrame = ???
val nFolds: Int = ???
val numTrees: Int = ???
val metric: String = ???
val rf = new RandomForestClassifier()
.setLabelCol("label")
.setFeaturesCol("features")
.setNumTrees(numTrees)
val pipeline = new Pipeline().setStages(Array(rf))
val paramGrid = new ParamGridBuilder().build() // No parameter search
val evaluator = new MulticlassClassificationEvaluator()
.setLabelCol("label")
.setPredictionCol("prediction")
// "f1" (default), "weightedPrecision", "weightedRecall", "accuracy"
.setMetricName(metric)
val cv = new CrossValidator()
// ml.Pipeline with ml.classification.RandomForestClassifier
.setEstimator(pipeline)
// ml.evaluation.MulticlassClassificationEvaluator
.setEvaluator(evaluator)
.setEstimatorParamMaps(paramGrid)
.setNumFolds(nFolds)
val model = cv.fit(trainingData) // trainingData: DataFrame
Using PySpark:
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
trainingData = ... # DataFrame[label: double, features: vector]
numFolds = ... # Integer
rf = RandomForestClassifier(labelCol="label", featuresCol="features")
evaluator = MulticlassClassificationEvaluator() # + other params as in Scala
pipeline = Pipeline(stages=[rf])
paramGrid = (ParamGridBuilder.
.addGrid(rf.numTrees, [3, 10])
.addGrid(...) # Add other parameters
.build())
crossval = CrossValidator(
estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=evaluator,
numFolds=numFolds)
model = crossval.fit(trainingData)
-
Are you sure that this works for leave-one-out? The kFold() call under the hood doesn't appear to deterministically return two folds length N-1 and 1. When I run the code above with a RegressionEvaluator and Lasso model I get: Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Nothing has been added to this summarizer. – paradiso Oct 01 '15 at 18:47
-
5No, I am pretty sure it doesn't. `MLUtils.kFold` is using `BernoulliCellSampler` to determine split. From the other hand cost of performing leave-one-out cross-validation in Spark is probably to high anyway to be make it feasible in practice. – zero323 Oct 02 '15 at 02:27
-
Hello @zero323, when you set a metric in your Evaluator object like .setMetricName("precision") . My question is, how can I get those metric calulated during training process? (Please refer this question: http://stackoverflow.com/questions/37778532/how-to-get-precision-recall-using-crossvalidator-for-training-naivebayes-model-u) – dbustosp Jun 13 '16 at 13:24
-
Hey @zero323 , is there a need to split data into training/testing when using cross validation? As CV trains and test over a number of folds, it should then give an accuracy of the average of the accuracy training/testing on the five folds? Or maybe I am way off. – other15 Jun 16 '16 at 15:05
-
@other15 Personally I would choose independent out-of-sample confirmation with CV as well. For obvious reason you can omit validation set though. – zero323 Jun 16 '16 at 20:00
-
AFAICT, you don't have access to the test metric on the test set, only the training set (for the best model). – Evan Zamir Sep 28 '16 at 19:37
-
1@zero323 I think you should change "precision" with "accuracy", according to https://issues.apache.org/jira/browse/SPARK-15771 – user299791 Feb 09 '17 at 21:45
-
@user299791 Thanks. Let's make it generic. – zero323 Feb 10 '17 at 00:14
-
@zero323 I would actually being able to understand better MulticlassClassificationEvaluator results, but it seems like there is not much examples around – user299791 Feb 10 '17 at 11:23
2
To build on zero323's great answer using Random Forest Classifier, here is a similar example for Random Forest Regressor:
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.tuning.{ParamGridBuilder, CrossValidator}
import org.apache.spark.ml.regression.RandomForestRegressor // CHANGED
import org.apache.spark.ml.evaluation.RegressionEvaluator // CHANGED
import org.apache.spark.ml.feature.{VectorAssembler, VectorIndexer}
val numFolds = ??? // Integer
val data = ??? // DataFrame
// Training (80%) and test data (20%)
val Array(train, test) = data.randomSplit(Array(0.8,0.2))
val featuresCols = data.columns
val va = new VectorAssembler()
va.setInputCols(featuresCols)
va.setOutputCol("rawFeatures")
val vi = new VectorIndexer()
vi.setInputCol("rawFeatures")
vi.setOutputCol("features")
vi.setMaxCategories(5)
val regressor = new RandomForestRegressor()
regressor.setLabelCol("events")
val metric = "rmse"
val evaluator = new RegressionEvaluator()
.setLabelCol("events")
.setPredictionCol("prediction")
// "rmse" (default): root mean squared error
// "mse": mean squared error
// "r2": R2 metric
// "mae": mean absolute error
.setMetricName(metric)
val paramGrid = new ParamGridBuilder().build()
val cv = new CrossValidator()
.setEstimator(regressor)
.setEvaluator(evaluator)
.setEstimatorParamMaps(paramGrid)
.setNumFolds(numFolds)
val model = cv.fit(train) // train: DataFrame
val predictions = model.transform(test)
predictions.show
val rmse = evaluator.evaluate(predictions)
println(rmse)
Evaluator metric source: https://spark.apache.org/docs/latest/api/scala/#org.apache.spark.ml.evaluation.RegressionEvaluator

Garren S
- 5,552
- 3
- 30
- 45