1

I use Spark MLlib for my project. I have used SVM, Decision Tree and Random Forest. I have split the dataset into Training and Testing (60% training, 40 % testing) and got my results.

I want to repeat my work but splitting the data using Cross Validation instead of percentage split for SVM, DT and RF.

How can I do that on Spark? I have found several codes for splitting using logistic regression and Pipeline whcih can not work for SVM.

I need to split the data int 10 fold, then apply SVM for now.

also I want to print the Accuracy for each fold.

  • Did you check this answer https://stackoverflow.com/questions/32769573/how-to-cross-validate-randomforest-model? – Moustafa Mahmoud Jan 03 '19 at 11:44
  • 1
    Thanks. This example and lot of examples I've seen using Pipeline which work for Decision tree and RF. But not for SVM. How can I do cross validation for SVM? – user3069106 Jan 06 '19 at 07:02

1 Answers1

0

You have to use the ML API instead of MLLib, in the first one the SVM model is called LinearSVC and you can use it the following way:

from pyspark.ml.classification import LinearSVC
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

dataset = spark.createDataFrame(
    [(Vectors.dense([0.0]), 0.0),
     (Vectors.dense([0.6]), 1.0),
     (Vectors.dense([1.0]), 1.0)] * 10,
    ["features", "label"])

cv = CrossValidator(
    estimator=LinearSVC(),
    estimatorParamMaps=ParamGridBuilder().build(),
    evaluator=MulticlassClassificationEvaluator(metricName='accuracy'),
    numFolds=10
)

best_model = cv.fit(dataset)
print(f'Best accuracy -> {best_model.avgMetrics[0]}')

Now, if you want to get the metric for all the folds you need to write your custom CrossValidator. This blog may help you!

Genarito
  • 3,027
  • 5
  • 27
  • 53