Pyspark k-fold cross validation average RMSE

Question

I am running linear regression with a k-fold cross validation on a dataset using Pyspark. I am at the moment only able to determine the RMSE of the best model. But I want the average RMSE for all the models evaluated in the cross validation. How do I get the average RMSE for all evaluated models in the cross validation?

from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

file_name = '/tmp/user/userfile/LS.csv'
data = spark.read.options(header='true', inferschema='true',                            
                          delimiter=',').csv(file_name)
data.cache()
features = ["x"]
lr_data = data.select(col("y").alias("label"), *features)
(training, test) = lr_data.randomSplit([.7, .3])

vectorAssembler = VectorAssembler(inputCols=features, outputCol="features")
training_ds = vectorAssembler.transform(training)
test_ds = vectorAssembler.transform(test)

lr = LinearRegression(maxIter=5, solver="l-bfgs") # solver="l-bfgs" here

modelEvaluator=RegressionEvaluator()

paramGrid = ParamGridBuilder().addGrid(lr.regParam, [0.1,0.01]) 
                              .addGrid(lr.elasticNetParam, [0, 1]).build()

crossval = CrossValidator(estimator=lr,
                          estimatorParamMaps=paramGrid,
                          evaluator=modelEvaluator,
                          numFolds=2)

cvModel = crossval.fit(training_ds)

prediction = cvModel.transform(test_ds)

evaluator = RegressionEvaluator(labelCol="label",
                                predictionCol="prediction",
                                metricName="rmse")

rms = evaluator.evaluate(prediction)
print("Root Mean Squared Error (RMSE) on test data = %g" % rms)

I edited your tags. Please check what tags you're using. (ML is a family of programming languages.) — molbdnilo, Dec 16 '18 at 19:41

score 1 · Answer 1 · answered Dec 16 '18 at 17:56

1

Simply need to extract other models from crossvalidator

Spark CrossValidatorModel access other models than the bestModel?

Then proceed with RegressionEvaluator on each and count average by hand.

answered Dec 16 '18 at 17:56

Cezary

11
1
2

Pyspark k-fold cross validation average RMSE

1 Answers1