4

I train a Random Forest with pySpark. I want to have a csv with the results, per dot in the grid. My code is:

estimator = RandomForestRegressor()
evaluator = RegressionEvaluator()
paramGrid = ParamGridBuilder().addGrid(estimator.numTrees, [2,3])\
                              .addGrid(estimator.maxDepth, [2,3])\
                              .addGrid(estimator.impurity, ['variance'])\
                              .addGrid(estimator.featureSubsetStrategy, ['sqrt'])\
                              .build()
pipeline = Pipeline(stages=[estimator])

crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=evaluator,
                          numFolds=3)

cvModel = crossval.fit(result)

So I want a csv:

numTrees | maxDepth | impurityMeasure 

2            2          0.001 

2            3          0.00023

Etc

What is the best way to do this?

Community
  • 1
  • 1
Cranjis
  • 1,590
  • 8
  • 31
  • 64

2 Answers2

10

You'll have to combine different bits of data:

  • Estimator ParamMaps extracted using getEstimatorParamMaps method.
  • Training metrics which can be retrieved using avgMetrics parameter.

First get names and values of all parameters declared in the map:

params = [{p.name: v for p, v in m.items()} for m in cvModel.getEstimatorParamMaps()]

Thane zip with metrics and convert to a data frame

import pandas as pd

pd.DataFrame.from_dict([
    {cvModel.getEvaluator().getMetricName(): metric, **ps} 
    for ps, metric in zip(params, cvModel.avgMetrics)
])
Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115
  • You must be using some outdated Python version. You can find methods which work with legacy versions [here](https://stackoverflow.com/q/38987/8371915). – Alper t. Turker Jul 08 '18 at 12:16
  • Is it possible to get the metric `accuracy` using the aforementioned procedure? – Simone Jul 12 '19 at 23:13
1

Really helpful answer here. Thought I would extend for those using the alternative pyspark tuning class.

pyspark.ml.tuning.TrainValidationSplit

The training metrics are now retrieved using the validationMetrics parameter

Replacing cvModel with tvsModel (an instance of pyspark.ml.tuning.TrainValidationSplitModel) the solution becomes:

params = [{p.name: v for p, v in m.items()} for m in tvsModel.getEstimatorParamMaps()]

pd.DataFrame.from_dict([
    {tvsModel.getEvaluator().getMetricName(): metric, **ps} 
    for ps, metric in zip(params, tvsModel.validationMetrics)
])
Clem Manger
  • 173
  • 1
  • 12