Spark Cross Validation with Training, Testing and Validation sets

Question

I want to do two Cross Validation processes in Spark using RandomSplits like

CV_global: by splitting data into Training Set 90% and Testing Set 10%

1.1. CV_grid: grid search on half of Training Set, i.e. 45% of data.

1.2. Fit Model: on Training set (90%) using the best settings from CV_grid.

1.3 Test Model: on Testing set (10%)

Report Average metrics per 10-fold and global metrics.

The problem is I only find examples using CV and Grid search on the whole training set.

How can I get the parameters of the best performing model from CV_grid?

How to do CV without grid search but get stats per fold? e.g. sklearn.cross_validation.cross_val_score

Actually `apache-spark` doesn't support that, you must do it by yourself by using `DataFrames` or `RDDs`. It is not so hard (I've already done it) — Alberto Bonsanto, Oct 29 '15 at 17:56
Well, I am using a ML pipeline for end to end so I was hopping not to need to break the code for this. The main question is how to get the parameters of the best model from ParamGridBuilder. I am not quite versed in Spark — harvybcn, Oct 31 '15 at 00:18

score 0 · Answer 1 · answered Sep 11 '16 at 12:30

You have things like

crossval.setEstimatorParamMaps(paramGrid)

and then

cvModel = crossval.fit(trainingSetDF).bestModel

For single models (at least for some) there are functions like explainParams(). It's available in spark 1.6 (maybe it goes back to 1.4.2, I'm not sure). Hope this helps

score 0 · Answer 2 · answered Jun 01 '17 at 10:11

You have three questions into one. The answers for each:

1. The problem is I only find examples using CV and Grid search on the whole training set.

if you need just a portion of your training dataset, then sample at the wanted percentage, e.g.

training = training.sample(false, .45, 78L)

2. How can I get the parameters of the best performing model from CV_grid?

crossValidatedModel.bestModel().getParamMap()

get from there the parameters names , and then values.

3. How to do CV without grid search but get stats per fold? e.g.

duplicate of How can I access computed metrics for each fold in a CrossValidatorModel

Take a look here: Spark CrossValidatorModel access other models than the bestModel?

Spark Cross Validation with Training, Testing and Validation sets

2 Answers2