How to save models from ML Pipeline to S3 or HDFS?

Question

I am trying to save thousands of models produced by ML Pipeline. As indicated in the answer here, the models can be saved as follows:

import java.io._

def saveModel(name: String, model: PipelineModel) = {
  val oos = new ObjectOutputStream(new FileOutputStream(s"/some/path/$name"))
  oos.writeObject(model)
  oos.close
}

schools.zip(bySchoolArrayModels).foreach{
  case (name, model) => saveModel(name, Model)
}

I have tried using s3://some/path/$name and /user/hadoop/some/path/$name as I would like the models to be saved to amazon s3 eventually but they both fail with messages indicating the path cannot be found.

How to save models to Amazon S3?

score 10 · Answer 1 · answered Sep 19 '15 at 04:12

10

One way to save a model to HDFS is as following:

// persist model to HDFS
sc.parallelize(Seq(model), 1).saveAsObjectFile("hdfs:///user/root/linReg.model")

Saved model can then be loaded as:

val linRegModel = sc.objectFile[LinearRegressionModel]("linReg.model").first()

For more details see (ref)

answered Sep 19 '15 at 04:12

Neil

7,482
6
50
56

it works, but when reload model from hdfs, some informations will lost, such as parent etc... – whb_zju Dec 03 '15 at 03:21

Alberto Bonsanto · Answer 2 · 2016-04-19T15:08:09.993

4

Since Apache-Spark 1.6 and in the Scala API, you can save your models without using any tricks. Because, all models from the ML library come with a save method, you can check this in the LogisticRegressionModel, indeed it has that method. By the way to load the model you can use a static method.

val logRegModel = LogisticRegressionModel.load("myModel.model")

edited Apr 19 '16 at 15:08

answered Feb 01 '16 at 19:17

Alberto Bonsanto

17,556
10
64
93

Hi @Alberto, looking at the API, there is no load method? Also, .save isn't available for other algorithms such as Random Forest. There doesn't seem to be a straight forward way to save models in ML. – other15 Jun 16 '16 at 10:55
2

Many of the ML models implement such methods others don't. I think the spark 2.0 version will fix this. – Alberto Bonsanto Jun 16 '16 at 12:12
Hopefully, it's strange that it has taken so long for this to be implemented. Another thing - I see some models, such as LogisticRegressionModel, have a save method, but no load method? How would you load your saved model? – other15 Jun 16 '16 at 15:26

score 1 · Answer 3 · answered Aug 30 '15 at 06:52

So FileOutputStream saves to local filesystem (not through the hadoop libraries), so saving to a locally directory is the way to go about doing this. That being said, the directory needs to exist, so make sure the directory exists first.

That being said, depending on your model you may wish to look at https://spark.apache.org/docs/latest/mllib-pmml-model-export.html (pmml export).

How to save models from ML Pipeline to S3 or HDFS?

3 Answers3

Linked