How to save the model after doing pipeline fit?

Question

I wrote this code in Spark ML

import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.Pipeline

val lr = new LogisticRegression()
val pipeline = new Pipeline()
                .setStages(Array(fooIndexer, fooHotEncoder, assembler, lr))
val model = pipeline.fit(training)

This code takes a long time to run. Is it possible that after running pipeline.fit I save the model on HDFS so that I don't have to run it again and again?

Edit: Also, how to load it back from HDFS when I have to apply transform on the model so that I can make predictions.

score 5 · Accepted Answer · answered May 29 '18 at 17:44

Straight from the official documentation - saving:

// Now we can optionally save the fitted pipeline to disk
model.write.overwrite().save("/tmp/spark-logistic-regression-model")

and loading:

// And load it back in during production
val sameModel = PipelineModel.load("/tmp/spark-logistic-regression-model")

Save ML model for future usage

How to save the model after doing pipeline fit?

1 Answers1