2

I use Spark MLLib to conduct a SVM classification on a RDD of LabeledPoints. I want to cross validate it. Which is the best way to do it? Does anyone have an example code? I found the CrossValidator class which relies on a DataFrame though.

My aim is to obtain the F-score.

zero323
  • 322,348
  • 103
  • 959
  • 935
jp_
  • 228
  • 2
  • 8

2 Answers2

1

I've faced the same issue for over a month until I realized that I must use the ML API instead of the MLlib API (more about the differences between both of them here). In that case, the SVM for the new API is the LinearSVC:

from pyspark.ml.classification import RandomForestClassifier, LinearSVC
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder, CrossValidatorModel
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# SVM
crossval = CrossValidator(estimator=LinearSVC(),
                          estimatorParamMaps=ParamGridBuilder().build(),
                          evaluator=MulticlassClassificationEvaluator(metricName='f1'),
                          numFolds=5,
                          parallelism=4)

# Random Forest
crossval = CrossValidator(estimator=RandomForestClassifier(),
                          estimatorParamMaps=ParamGridBuilder().build(),
                          evaluator=MulticlassClassificationEvaluator(metricName='f1'),
                          numFolds=5,
                          parallelism=4)

In both cases you can just fit the model:

cross_model: CrossValidatorModel = crossval.fit
Genarito
  • 3,027
  • 5
  • 27
  • 53
0

You can find a complete example on Spark's github, though not with SVM but logistic regression.

The best way is to change your RDD into a DataFrame using rdd.toDF() method.

Mateusz Dymczyk
  • 14,969
  • 10
  • 59
  • 94
  • 3
    Thanks so far. In the example a LogisticRegression object is instanciated and inserted into the pipeline. It can't find any SVM to instantiate which fits into the pipeline though. Which class to use? – jp_ Mar 11 '16 at 15:25