2

I am running linear regression with a k-fold cross validation on a dataset using Pyspark. I am at the moment only able to determine the RMSE of the best model. But I want the average RMSE for all the models evaluated in the cross validation. How do I get the average RMSE for all evaluated models in the cross validation?

from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

file_name = '/tmp/user/userfile/LS.csv'
data = spark.read.options(header='true', inferschema='true',                            
                          delimiter=',').csv(file_name)
data.cache()
features = ["x"]
lr_data = data.select(col("y").alias("label"), *features)
(training, test) = lr_data.randomSplit([.7, .3])

vectorAssembler = VectorAssembler(inputCols=features, outputCol="features")
training_ds = vectorAssembler.transform(training)
test_ds = vectorAssembler.transform(test)

lr = LinearRegression(maxIter=5, solver="l-bfgs") # solver="l-bfgs" here

modelEvaluator=RegressionEvaluator()

paramGrid = ParamGridBuilder().addGrid(lr.regParam, [0.1,0.01]) 
                              .addGrid(lr.elasticNetParam, [0, 1]).build()

crossval = CrossValidator(estimator=lr,
                          estimatorParamMaps=paramGrid,
                          evaluator=modelEvaluator,
                          numFolds=2)

cvModel = crossval.fit(training_ds)

prediction = cvModel.transform(test_ds)

evaluator = RegressionEvaluator(labelCol="label",
                                predictionCol="prediction",
                                metricName="rmse")

rms = evaluator.evaluate(prediction)
print("Root Mean Squared Error (RMSE) on test data = %g" % rms)
molbdnilo
  • 64,751
  • 3
  • 43
  • 82

1 Answers1

1

Simply need to extract other models from crossvalidator

Spark CrossValidatorModel access other models than the bestModel?

Then proceed with RegressionEvaluator on each and count average by hand.

Cezary
  • 11
  • 1
  • 2