38

I'm tinkering with some cross-validation code from the PySpark documentation, and trying to get PySpark to tell me what model was selected:

from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.mllib.linalg import Vectors
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

dataset = sqlContext.createDataFrame(
    [(Vectors.dense([0.0]), 0.0),
     (Vectors.dense([0.4]), 1.0),
     (Vectors.dense([0.5]), 0.0),
     (Vectors.dense([0.6]), 1.0),
     (Vectors.dense([1.0]), 1.0)] * 10,
    ["features", "label"])
lr = LogisticRegression()
grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, 0.0001]).build()
evaluator = BinaryClassificationEvaluator()
cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator)
cvModel = cv.fit(dataset)

Running this in PySpark shell, I can get the linear regression model's coefficients, but I can't seem to find the value of lr.regParam selected by the cross validation procedure. Any ideas?

In [3]: cvModel.bestModel.coefficients
Out[3]: DenseVector([3.1573])

In [4]: cvModel.bestModel.explainParams()
Out[4]: ''

In [5]: cvModel.bestModel.extractParamMap()
Out[5]: {}

In [15]: cvModel.params
Out[15]: []

In [36]: cvModel.bestModel.params
Out[36]: []
Paul
  • 3,321
  • 1
  • 33
  • 42
  • 4
    Relevant question in Spark Scala API: http://stackoverflow.com/questions/31749593/how-to-extract-best-parameters-from-a-crossvalidatormodel – desertnaut Apr 20 '16 at 13:53
  • pyspark answer here: https://stackoverflow.com/questions/39529012/pyspark-get-all-parameters-of-models-created-with-paramgridbuilder – marilena.oita May 22 '17 at 14:23
  • Make sure to mark the answer (wernerchao's below worked for me). – Ross Lewis Aug 29 '17 at 16:19
  • I'll take your word for it, although this project is now a distant memory for me... – Paul Aug 29 '17 at 17:21

8 Answers8

45

Ran into this problem as well. I found out you need to call the java property for some reason I don't know why. So just do this:

from pyspark.ml.tuning import TrainValidationSplit, ParamGridBuilder, CrossValidator
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator

evaluator = RegressionEvaluator(metricName="mae")
lr = LinearRegression()
grid = ParamGridBuilder().addGrid(lr.maxIter, [500]) \
                                .addGrid(lr.regParam, [0]) \
                                .addGrid(lr.elasticNetParam, [1]) \
                                .build()
lr_cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, \
                        evaluator=evaluator, numFolds=3)
lrModel = lr_cv.fit(your_training_set_here)
bestModel = lrModel.bestModel

Printing out the parameters you want:

>>> print 'Best Param (regParam): ', bestModel._java_obj.getRegParam()
0
>>> print 'Best Param (MaxIter): ', bestModel._java_obj.getMaxIter()
500
>>> print 'Best Param (elasticNetParam): ', bestModel._java_obj.getElasticNetParam()
1

This applies to other methods like extractParamMap() as well. They should fix this soon.

wernerchao
  • 636
  • 6
  • 6
  • 6
    Nice catch. Even better than a fix would be a feature like `cvModel.getAllTheBestModelsParametersPlease()` – George Fisher Aug 13 '17 at 13:15
  • 11
    The answer didn't work for me. The correct one is: `modelOnly.bestModel.stages[-1]._java_obj.parent().getRegParam()`. Or if you don't use pipeline, just remove that `stages[-1]`. – Lynn Chen Nov 18 '18 at 12:45
11

This might not be as good as wernerchao answer (because it's not convenient to store hyperparameters in variables), but you can quickly look at the best hyper-parameters of a cross validation model this way :

cvModel.getEstimatorParamMaps()[ np.argmax(cvModel.avgMetrics) ]
Pierre Gourseaud
  • 2,347
  • 13
  • 24
4

Assuming cvModel3Day is your model names, params can be extracted as shown below in Spark Scala

val params = cvModel3Day.bestModel.asInstanceOf[PipelineModel].stages(2).asInstanceOf[GBTClassificationModel].extractParamMap()

val depth = cvModel3Day.bestModel.asInstanceOf[PipelineModel].stages(2).asInstanceOf[GBTClassificationModel].getMaxDepth

val iter = cvModel3Day.bestModel.asInstanceOf[PipelineModel].stages(2).asInstanceOf[GBTClassificationModel].getMaxIter

val bins = cvModel3Day.bestModel.asInstanceOf[PipelineModel].stages(2).asInstanceOf[GBTClassificationModel].getMaxBins

val features  = cvModel3Day.bestModel.asInstanceOf[PipelineModel].stages(2).asInstanceOf[GBTClassificationModel].getFeaturesCol

val step = cvModel3Day.bestModel.asInstanceOf[PipelineModel].stages(2).asInstanceOf[GBTClassificationModel].getStepSize

val samplingRate  = cvModel3Day.bestModel.asInstanceOf[PipelineModel].stages(2).asInstanceOf[GBTClassificationModel].getSubsamplingRate
Ashish Markanday
  • 1,300
  • 10
  • 11
3

I also bounced my head on this wall, unfortunately you can get only specific parameters for specific models. Happily for logistic regression you can access the intercept and weights, sadly you can not retrieve the regParam. This can be done in the following way:

best_lr = cv.bestModel

#get weigths
best_lr.weights
>>>DenseVector([3.1573])

#or better
best_lr.coefficients
>>>DenseVector([3.1573])

#get intercept
best_lr.intercept
>>>-1.0829958115287153

As I wrote before, each model has few parameters that can be extracted. Overall getting the relevant models from a Pipeline (e.g. cv.bestModel when the Cross Validator runs over a Pipeline) can be done with:

best_pipeline = cv.bestModel
best_pipeline.stages
>>>[Tokenizer_4bc8884ad68b4297fd3c,CountVectorizer_411fbdeb4100c2bfe8ef, PCA_4c538d67e7b8f29ff8d0,LogisticRegression_4db49954edc7033edc76]

Each model is obtained by simple list indexing

best_lr = best_pipeline.stages[3]

Now the above can be applied.

elkbrs
  • 121
  • 4
2

There are two questions actually:

  • what are the aspects of the fitted model (like coefficients and intercepts)
  • what were the meta parameters using which the bestModel was fitted.

Unfortunately the python api of the fitted estimators (the models) doesn't allow (easy) direct access to the parameters of the estimator, which makes it hard to answer the latter question.

However there is a workaround using the java api. For completeness, first a full setup of a cross validated model

%pyspark
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
logit = LogisticRegression(maxIter=10)
pipeline = Pipeline(stages=[logit])
paramGrid = ParamGridBuilder() \
    .addGrid(logit.regParam, [0, 0.01, 0.05, 0.1, 0.5, 1]) \
    .addGrid(logit.elasticNetParam, [0.0, 0.1, 0.5, 0.8, 1]) \
    .build()
evaluator = BinaryClassificationEvaluator(metricName = 'areaUnderPR')
crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=evaluator,
                          numFolds=5)
tuned_model = crossval.fit(train)
model = tuned_model.bestModel

One could then use the generic methods on the java object to get the parameter values, without explicitly referring to methods like getRegParam() :

java_model = model.stages[-1]._java_obj
{param.name: java_model.getOrDefault(java_model.getParam(param.name)) 
    for param in paramGrid[0]}

This executes the following steps:

  1. Get the fitted logit model as created by the estimator from the last stage of the best model: crossval.fit(..).bestModel.stages[-1]
  2. Get the internal java object from _java_obj
  3. Get all configured names from the paramGrid (which is a list of dictionaries). Only the first row is used, assuming it is an actual grid, as in, each row contains the same keys. Otherwise you need to collect all names ever used in any row.
  4. Get the corresponding Param<T> parameter identifier from the java object.
  5. Pass the Param<T> instance to the getOrDefault() function to get the actual value
gerben
  • 692
  • 4
  • 16
2

This took a couple minutes to decipher, but I figured it out.

from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

    # prenotation: I've built out my model already and I am calling the validator ParamGridBuilder
paramGrid = ParamGridBuilder() \
                          .addGrid(hashingTF.numFeatures, [1000]) \
                          .addGrid(linearSVC.regParam, [0.1, 0.01]) \
                          .addGrid(linearSVC.maxIter, [10, 20, 30]) \
                          .build()
crossval = CrossValidator(estimator=pipeline,\
                          estimatorParamMaps=paramGrid,\
                          evaluator=MulticlassClassificationEvaluator(),\
                          numFolds=2)

cvModel = crossval.fit(train)

prediction = cvModel.transform(test)


bestModel = cvModel.bestModel

    #applicable to your model to pull list of all stages
for x in range(len(bestModel.stages)):
print bestModel.stages[x]


    #get stage feature by calling correct Transformer then .get<parameter>()
print bestModel.stages[3].getNumFeatures()
2

(2020-05-21)

I know this is a old question, but I found a way to do this.
@Pierre Gourseaud gives us a nice way to get hyperparams for the best model

hyperparams = model_cv.getEstimatorParamMaps()[np.argmax(model_cv.avgMetrics)]
print(hyperparams)
[(Param(parent='ALS_cd65d45ab31c', name='implicitPrefs', doc='whether to use implicit preference'),
  True),
 (Param(parent='ALS_cd65d45ab31c', name='nonnegative', doc='whether to use nonnegative constraint for least squares'),
  True),
 (Param(parent='ALS_cd65d45ab31c', name='coldStartStrategy', doc="strategy for dealing with unknown or new users/items at prediction time. This may be useful in cross-validation or production scenarios, for handling user/item ids the model has not seen in the training data. Supported values: 'nan', 'drop'."),
  'drop'),
 (Param(parent='ALS_cd65d45ab31c', name='rank', doc='rank of the factorization'),
  28),
 (Param(parent='ALS_cd65d45ab31c', name='maxIter', doc='max number of iterations (>= 0).'),
  20),
 (Param(parent='ALS_cd65d45ab31c', name='regParam', doc='regularization parameter (>= 0).'),
  0.01),
 (Param(parent='ALS_cd65d45ab31c', name='alpha', doc='alpha for implicit preference'),
  20.0)]

But this is not in a fashion way look, so you can do this:

import re

hyper_list = []

for i in range(len(hyperparams.items())):
    hyper_name = re.search("name='(.+?)'", str([x for x in hyperparams.items()][i])).group(1)
    hyper_value = [x for x in hyperparams.items()][i][1]

    hyper_list.append({hyper_name: hyper_value})

print(hyper_list)
[{'implicitPrefs': True}, {'nonnegative': True}, {'coldStartStrategy': 'drop'}, {'rank': 28}, {'maxIter': 20}, {'regParam': 0.01}, {'alpha': 20.0}]

In my case I've trained an ALS model, but it should works in your case, because I've trained with CrossValidation too!

igorkf
  • 3,159
  • 2
  • 22
  • 31
0

If you want just param names and their values

 {param.name: value for param, value in zip(cvModel.bestModel.extractParamMap().keys(), cvModel.bestModel.extractParamMap().values())}

and if you dont mind descriptions etc use just

cvModel.bestModel.extractParamMap()

outputs will be

    Out[58]: {'aggregationDepth': 2,
 'elasticNetParam': 0.0,
 'family': 'auto',
 'featuresCol': 'features',
 'fitIntercept': True,
 'labelCol': 'label',
 'maxBlockSizeInMB': 0.0,
 'maxIter': 10,
 'predictionCol': 'prediction',
 'probabilityCol': 'probability',
 'rawPredictionCol': 'rawPrediction',
 'regParam': 0.01,
 'standardization': True,
 'threshold': 0.5,
 'tol': 1e-06}

and

    Out[54]: {Param(parent='LogisticRegression_a6db1af69019', name='aggregationDepth', doc='suggested depth for treeAggregate (>= 2).'): 2,
 Param(parent='LogisticRegression_a6db1af69019', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0,
 Param(parent='LogisticRegression_a6db1af69019', name='family', doc='The name of family which is a description of the label distribution to be used in the model. Supported options: auto, binomial, multinomial'): 'auto',
 Param(parent='LogisticRegression_a6db1af69019', name='featuresCol', doc='features column name.'): 'features',
 Param(parent='LogisticRegression_a6db1af69019', name='fitIntercept', doc='whether to fit an intercept term.'): True,
 Param(parent='LogisticRegression_a6db1af69019', name='labelCol', doc='label column name.'): 'label',
 Param(parent='LogisticRegression_a6db1af69019', name='maxBlockSizeInMB', doc='maximum memory in MB for stacking input data into blocks. Data is stacked within partitions. If more than remaining data size in a partition then it is adjusted to the data size. Default 0.0 represents choosing optimal value, depends on specific algorithm. Must be >= 0.'): 0.0,
 Param(parent='LogisticRegression_a6db1af69019', name='maxIter', doc='max number of iterations (>= 0).'): 10,
 Param(parent='LogisticRegression_a6db1af69019', name='predictionCol', doc='prediction column name.'): 'prediction',
 Param(parent='LogisticRegression_a6db1af69019', name='probabilityCol', doc='Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities.'): 'probability',
 Param(parent='LogisticRegression_a6db1af69019', name='rawPredictionCol', doc='raw prediction (a.k.a. confidence) column name.'): 'rawPrediction',
 Param(parent='LogisticRegression_a6db1af69019', name='regParam', doc='regularization parameter (>= 0).'): 0.01,
 Param(parent='LogisticRegression_a6db1af69019', name='standardization', doc='whether to standardize the training features before fitting the model.'): True,
 Param(parent='LogisticRegression_a6db1af69019', name='threshold', doc='Threshold in binary classification prediction, in range [0, 1]. If threshold and thresholds are both set, they must match.e.g. if threshold is p, then thresholds must be equal to [1-p, p].'): 0.5,
 Param(parent='LogisticRegression_a6db1af69019', name='tol', doc='the convergence tolerance for iterative algorithms (>= 0).'): 1e-06}
Gozdi
  • 41
  • 1
  • 1
  • 6