Extract results from CrossValidator with paramGrid in pySpark

Question

I train a Random Forest with pySpark. I want to have a csv with the results, per dot in the grid. My code is:

estimator = RandomForestRegressor()
evaluator = RegressionEvaluator()
paramGrid = ParamGridBuilder().addGrid(estimator.numTrees, [2,3])\
                              .addGrid(estimator.maxDepth, [2,3])\
                              .addGrid(estimator.impurity, ['variance'])\
                              .addGrid(estimator.featureSubsetStrategy, ['sqrt'])\
                              .build()
pipeline = Pipeline(stages=[estimator])

crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=evaluator,
                          numFolds=3)

cvModel = crossval.fit(result)

So I want a csv:

numTrees | maxDepth | impurityMeasure 

2            2          0.001 

2            3          0.00023

Etc

What is the best way to do this?

score 10 · Accepted Answer · answered Jul 08 '18 at 11:13

10

You'll have to combine different bits of data:

Estimator ParamMaps extracted using getEstimatorParamMaps method.
Training metrics which can be retrieved using avgMetrics parameter.

First get names and values of all parameters declared in the map:

params = [{p.name: v for p, v in m.items()} for m in cvModel.getEstimatorParamMaps()]

Thane zip with metrics and convert to a data frame

import pandas as pd

pd.DataFrame.from_dict([
    {cvModel.getEvaluator().getMetricName(): metric, **ps} 
    for ps, metric in zip(params, cvModel.avgMetrics)
])

answered Jul 08 '18 at 11:13

Alper t. Turker

34,230
9
83
115

You must be using some outdated Python version. You can find methods which work with legacy versions [here](https://stackoverflow.com/q/38987/8371915). – Alper t. Turker Jul 08 '18 at 12:16
Is it possible to get the metric `accuracy` using the aforementioned procedure? – Simone Jul 12 '19 at 23:13

score 1 · Answer 2 · answered Mar 11 '21 at 15:47

Really helpful answer here. Thought I would extend for those using the alternative pyspark tuning class.

pyspark.ml.tuning.TrainValidationSplit

The training metrics are now retrieved using the validationMetrics parameter

Replacing cvModel with tvsModel (an instance of pyspark.ml.tuning.TrainValidationSplitModel) the solution becomes:

params = [{p.name: v for p, v in m.items()} for m in tvsModel.getEstimatorParamMaps()]

pd.DataFrame.from_dict([
    {tvsModel.getEvaluator().getMetricName(): metric, **ps} 
    for ps, metric in zip(params, tvsModel.validationMetrics)
])

Extract results from CrossValidator with paramGrid in pySpark

2 Answers2

Linked