How to split data using cross validation on Spark for SVM and DT

Question

I use Spark MLlib for my project. I have used SVM, Decision Tree and Random Forest. I have split the dataset into Training and Testing (60% training, 40 % testing) and got my results.

I want to repeat my work but splitting the data using Cross Validation instead of percentage split for SVM, DT and RF.

How can I do that on Spark? I have found several codes for splitting using logistic regression and Pipeline whcih can not work for SVM.

I need to split the data int 10 fold, then apply SVM for now.

also I want to print the Accuracy for each fold.

Did you check this answer https://stackoverflow.com/questions/32769573/how-to-cross-validate-randomforest-model? — Moustafa Mahmoud, Jan 03 '19 at 11:44
Thanks. This example and lot of examples I've seen using Pipeline which work for Decision tree and RF. But not for SVM. How can I do cross validation for SVM? — user3069106, Jan 06 '19 at 07:02

score 0 · Answer 1 · answered Jun 11 '21 at 14:06

You have to use the ML API instead of MLLib, in the first one the SVM model is called LinearSVC and you can use it the following way:

from pyspark.ml.classification import LinearSVC
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

dataset = spark.createDataFrame(
    [(Vectors.dense([0.0]), 0.0),
     (Vectors.dense([0.6]), 1.0),
     (Vectors.dense([1.0]), 1.0)] * 10,
    ["features", "label"])

cv = CrossValidator(
    estimator=LinearSVC(),
    estimatorParamMaps=ParamGridBuilder().build(),
    evaluator=MulticlassClassificationEvaluator(metricName='accuracy'),
    numFolds=10
)

best_model = cv.fit(dataset)
print(f'Best accuracy -> {best_model.avgMetrics[0]}')

Now, if you want to get the metric for all the folds you need to write your custom CrossValidator. This blog may help you!

How to split data using cross validation on Spark for SVM and DT

1 Answers1