15

I am trying to save thousands of models produced by ML Pipeline. As indicated in the answer here, the models can be saved as follows:

import java.io._

def saveModel(name: String, model: PipelineModel) = {
  val oos = new ObjectOutputStream(new FileOutputStream(s"/some/path/$name"))
  oos.writeObject(model)
  oos.close
}

schools.zip(bySchoolArrayModels).foreach{
  case (name, model) => saveModel(name, Model)
}

I have tried using s3://some/path/$name and /user/hadoop/some/path/$name as I would like the models to be saved to amazon s3 eventually but they both fail with messages indicating the path cannot be found.

How to save models to Amazon S3?

Community
  • 1
  • 1
SH Y.
  • 1,709
  • 3
  • 20
  • 21

3 Answers3

10

One way to save a model to HDFS is as following:

// persist model to HDFS
sc.parallelize(Seq(model), 1).saveAsObjectFile("hdfs:///user/root/linReg.model")

Saved model can then be loaded as:

val linRegModel = sc.objectFile[LinearRegressionModel]("linReg.model").first()

For more details see (ref)

Neil
  • 7,482
  • 6
  • 50
  • 56
4

Since Apache-Spark 1.6 and in the Scala API, you can save your models without using any tricks. Because, all models from the ML library come with a save method, you can check this in the LogisticRegressionModel, indeed it has that method. By the way to load the model you can use a static method.

val logRegModel = LogisticRegressionModel.load("myModel.model")
Alberto Bonsanto
  • 17,556
  • 10
  • 64
  • 93
  • Hi @Alberto, looking at the API, there is no load method? Also, .save isn't available for other algorithms such as Random Forest. There doesn't seem to be a straight forward way to save models in ML. – other15 Jun 16 '16 at 10:55
  • 2
    Many of the ML models implement such methods others don't. I think the spark 2.0 version will fix this. – Alberto Bonsanto Jun 16 '16 at 12:12
  • Hopefully, it's strange that it has taken so long for this to be implemented. Another thing - I see some models, such as LogisticRegressionModel, have a save method, but no load method? How would you load your saved model? – other15 Jun 16 '16 at 15:26
1

So FileOutputStream saves to local filesystem (not through the hadoop libraries), so saving to a locally directory is the way to go about doing this. That being said, the directory needs to exist, so make sure the directory exists first.

That being said, depending on your model you may wish to look at https://spark.apache.org/docs/latest/mllib-pmml-model-export.html (pmml export).

Holden
  • 7,392
  • 1
  • 27
  • 33