Spark ml: Is it possible to save trained model in PySpark and read from Java Spark code?

Question

I have a PySpark job which processes input data and trains a logistic regression model. I need to somehow transfer this trained model to a production code which is written in Java Spark. After loading this trained model from Java code, it will pass features to get prediction from the model.

From PySpark side, I'm using the dataframe API (spark.ml), not mllib.

Is it possible to save the trained (fitted) model to a file and read it back from the Java Spark code? If there's a better way, please let me know.

score 1 · Accepted Answer · answered Mar 09 '19 at 19:54

Yes it is possible. With a single exception of SparkR, which requires additional metadata for model loading, all native ML models (custom guest language extensions notwithstanding) can be saved and loaded with arbitrary backend.

Just save MLWritable object on one side, using its save method or its writer (write) and load back with compatible Readable on the other side. Let's say in Python:

from pyspark.ml.feature import StringIndexer

StringIndexer(inputCol="foo", outputCol="bar").write().save("/tmp/indexer")

and in Scala

import org.apache.spark.ml.feature.StringIndexer

val indexer = StringIndexer.load("/tmp/indexer")
indexer.getInputCol
// String = foo

That being said ML models are typically bad choices for production use, and more suitable options exist - How to serve a Spark MLlib model?.

Thanks for your reply. Can I ask what you mean by "ML models are typically bad choices for production use"? I'm trying to do all the heavy-load training from a "test pipeline" and save the model to a file. From the production pipeline, I'll simply load this pipeline, iterate through each row of a dataframe, and predict the outcome based on the row's features. I was under the impression that prediction generally takes very short time. — Brian C, Mar 10 '19 at 01:40
@BrianC The main problem with ML Pipelines is lack of support for local predict. It means you need at least `SparkSession` and `DataFrame` for predict - that's a huge overhead for let's say, single prediction, and has direct impact on prediction latency. — 10465355, Mar 10 '19 at 11:36

score 0 · Answer 2 · answered Mar 09 '19 at 19:52

Welcome to SO. Have you tried doing this? In general, it must be working - if you save spark.ml model, then you could load it with spark from any language supporting spark. Anyway, Logistic regression is a simple model so you can just save its weights as an array and recreate it in your code.

Spark ml: Is it possible to save trained model in PySpark and read from Java Spark code?

2 Answers2