Saving Spark ML pipeline to a database

Question

Is it possible to save a Spark ML pipeline to a database (Cassandra for example)? From the documentation I can only see the save to path option:

myMLWritable.save(toPath);

Is there a way to somehow wrap or change the myMLWritable.write() MLWriter instance and redirect the output to the database?

score 0 · Accepted Answer · edited Nov 23 '17 at 15:42

0

It is not possible (or at least no supported) at this moment. ML writer is not extendable and depends on Parquet files and directory structure to represent models.

Technically speaking you can extract individual components and use internal private API to recreate models from scratch, but it is likely the only option.

edited Nov 23 '17 at 15:42

Indrajit Swain

1,505
1
15
22

answered Nov 23 '17 at 11:48

user8996166

16

score 0 · Answer 2 · answered Nov 23 '17 at 15:13

Spark 2.0.0+

At first glance all Transformers and Estimators implement MLWritable. If you use Spark <= 1.6.0 and experience some issues with model saving I would suggest switching version.

Spark >= 1.6

Since Spark 1.6 it's possible to save your models using the save method. Because almost every model implements the MLWritable interface. For example LogisticRegressionModel, has it, and therefore it's possible to save your model to the desired path using it.

Spark < 1.6

Some operations on a DataFrames can be optimized and it translates to improved performance compared to plain RDDs. DataFramesprovide efficient caching and SQLish API is arguably easier to comprehend than RDD API.

ML Pipelinesare extremely useful and tools like cross-validator or differentevaluators are simply must-have in any machine pipeline and even if none of the above is particularly hard do implement on top of low level MLlib API it is much better to have ready to use, universal and relatively well tested solution.

I believe that at the end of the day what you get by using ML over MLLibis quite elegant, high level API. One thing you can do is to combine both to create a custom multi-step pipeline:

use ML to load, clean and transform data,
extract required data (see for example [extractLabeledPoints ]4 method) and pass to MLLib algorithm,
add custom cross-validation / evaluation
save MLLib model using a method of your choice (Spark model or PMML)

In Jira also there is temporary solution provided . Temporary Solution

No i dont think this would allow it. because in that also you need to save the model to a distributed file system. I dont think it will work with the Cassandra itself ! I have saved the similar problem. How we solved the problem was that , we saved the meta data in the Cassandra i.e URL and then saved the model in the HDFS itself ! I hope that it helps ! — Shivansh, Nov 23 '17 at 15:45

Saving Spark ML pipeline to a database

2 Answers2