0

I wrote this code in Spark ML

import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.Pipeline

val lr = new LogisticRegression()
val pipeline = new Pipeline()
                .setStages(Array(fooIndexer, fooHotEncoder, assembler, lr))
val model = pipeline.fit(training)

This code takes a long time to run. Is it possible that after running pipeline.fit I save the model on HDFS so that I don't have to run it again and again?

Edit: Also, how to load it back from HDFS when I have to apply transform on the model so that I can make predictions.

Knows Not Much
  • 30,395
  • 60
  • 197
  • 373

1 Answers1

5

Straight from the official documentation - saving:

// Now we can optionally save the fitted pipeline to disk
model.write.overwrite().save("/tmp/spark-logistic-regression-model")

and loading:

// And load it back in during production
val sameModel = PipelineModel.load("/tmp/spark-logistic-regression-model")

Related: