Loading a pyspark ML model in a non-Spark environment

Question

I am interested in deploying a machine learning model in python, so predictions can be made through requests to a server.

I will create a Cloudera cluster and take advantage of Spark to develop the models, by using the library pyspark. I would like to know how the model can be saved in order to employ it on the server.

I have seen that the different algorithms have the .save functions (like it is answered in this post How to save and load MLLib model in Apache Spark), but as the server will be in a different machine without Spark and not in the Cloudera cluster, I don't know if it is possible to use their .load and .predict functions.

Can it be made by using the pyspark library functions for prediction without Spark underneath? Or would I have to do any transformations in order to save the model and use it elsewhere?

i believe that you need to have spark. One thing i would say is, You can create a simple rest api in python and load model file and send response. — backtrack, Nov 21 '16 at 08:15

score 4 · Accepted Answer · answered Nov 21 '16 at 11:20

After spending an hour i got this working code, This may not be optimized,

Mymodel.py:

import os
import sys

# Path for spark source folder
os.environ['SPARK_HOME']="E:\\Work\\spark\\installtion\\spark"

# Append pyspark  to Python Path
sys.path.append("E:\\Work\\spark\\installtion\\spark\\python")

try:
    from pyspark.ml.feature import StringIndexer
    # $example on$
    from numpy import array
    from math import sqrt
    from pyspark import SparkConf
    # $example off$

    from pyspark import SparkContext
    # $example on$
    from pyspark.mllib.clustering import KMeans, KMeansModel

    print ("Successfully imported Spark Modules")

except ImportError as e:
    sys.exit(1)


if __name__ == "__main__":
    sconf = SparkConf().setAppName("KMeansExample").set('spark.sql.warehouse.dir', 'file:///E:/Work/spark/installtion/spark/spark-warehouse/')
    sc = SparkContext(conf=sconf)  # SparkContext
    parsedData =  array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4,2)
    clusters = KMeans.train(sc.parallelize(parsedData), 2, maxIterations=10,
                            runs=10, initializationMode="random")
    clusters.save(sc, "mymodel")  // this will save model to file system
    sc.stop()

This code will create a kmean cluster model and save it in file system.

API.py

from flask import jsonify, request, Flask
from sklearn.externals import joblib
import os
import sys

# Path for spark source folder
os.environ['SPARK_HOME']="E:\\Work\\spark\\installtion\\spark"

# Append pyspark  to Python Path
sys.path.append("E:\\Work\\spark\\installtion\\spark\\python")

try:
    from pyspark.ml.feature import StringIndexer
    # $example on$
    from numpy import array
    from math import sqrt
    from pyspark import SparkConf
    # $example off$

    from pyspark import SparkContext
    # $example on$
    from pyspark.mllib.clustering import KMeans, KMeansModel

    print ("Successfully imported Spark Modules")

except ImportError as e:
    sys.exit(1)


app = Flask(__name__)

@app.route('/', methods=['GET'])
def predict():

    sconf = SparkConf().setAppName("KMeansExample").set('spark.sql.warehouse.dir', 'file:///E:/Work/spark/installtion/spark/spark-warehouse/')
    sc = SparkContext(conf=sconf)  # SparkContext
    sameModel = KMeansModel.load(sc, "clus")  // load from file system 

    response = sameModel.predict(array([0.0, 0.0]))  // pass your data

    return jsonify(response)

if __name__ == '__main__':
    app.run()

Above is my REST api written in flask.

Make the call to http://127.0.0.1:5000/. You can see the response in browser.

I´d like to know in which format is the model saved when clusters.save is invoked. Thanks in advance. — daloman, Nov 21 '16 at 11:29
Hi, thank you for your answer. But there is one thing I am not sure about. Could I run the API.py script in a machine where it is only python installed? Or do I need to have Spark installed also? In that case, is it enough to install the stand-alone version? — Marcial Gonzalez, Nov 24 '16 at 11:12
@MarcialGonzalez, Yes we have to install Spark in server or you can do one another thing, Make a port based communication between your rest and spark ml server. — backtrack, Nov 24 '16 at 12:21
@MarcialGonzalez, In my production we have a REST api exposed to client and we have our ML server running, REST api will communicate using port based with the ML server and return response — backtrack, Nov 24 '16 at 12:24
And is it posible to persist a spark ml model generated with pyspark by using for example pickle or joblib? The idea is to export and load it into a machine where only Python is installed. — Marcial Gonzalez, Nov 24 '16 at 17:20
Note that this is not really what the OP is asking, as s/he was asking about serving a SparkMLLib model without Spark. The short answer is that it is not possible. — MrE, May 19 '22 at 18:01

score 2 · Answer 2 · answered Jan 29 '17 at 00:02

Take a look at MLeap (a project I contribute to) - it provides serialization/de-serialization of entire ML Pipelines (not just the estimator) and an execution engine that doesn't rely on the spark context, distributed data frames and execution plans.

As of today, MLeap's runtime for executing models doesn't have python bindings, only scala/java, but shouldn't be complicate to add them. Feel free to reach out on github to myself and other MLeap developers if you need help creating a scoring engine from your Spark-trained pipelines and models.

I just created the https://stackoverflow.com/questions/tagged/mleap tag, you might want to follow it. Is there a plan to integrate mleap into the main Spark project/branch? How about support for Java 8? — Marsellus Wallace, Jan 05 '18 at 19:01

backtrack · Answer 3 · 2016-11-21T09:35:25.930

-1

This may not be the complete solution.

Model.py

from sklearn.externals import joblib
from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.svm import LinearSVC

# code to load training data into X_train, y_train, split train/test set

vec = HashingVectorizer()
svc = LinearSVC()
clf = make_pipeline(vec, svc)
svc.fit(X_train, y_train)

joblib.dump({'class1': clf}, 'models', compress=9)

myRest.py

from flask import jsonify, request, Flask
from sklearn.externals import joblib

models = joblib.load('models')
app = Flask(__name__)

@app.route('/', methods=['POST'])
def predict():
    text = request.form.get('text')
    results = {}
    for name, clf in models.iteritems():
        results[name] = clf.predict([text])[0]
    return jsonify(results)

if __name__ == '__main__':
    app.run()

something like this you can do. ref: https://loads.pickle.me.uk/2016/04/04/deploying-a-scikit-learn-classifier-to-production/

for spark : http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html

edited Nov 21 '16 at 09:35

answered Nov 21 '16 at 08:33

backtrack

7,996
5
52
99

I am afraid it is not a solution at all. PySpark `ml` is not `scikit-learn`. – Nov 21 '16 at 09:03
1

@LostInOverflow, I too know that i added example for scikit-learn. Indeed i accept your comment. But we can even load spark ml model also like this . sameModel = MatrixFactorizationModel.load(sc, "target/tmp/myCollaborativeFilter") . Check this link : http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html – backtrack Nov 21 '16 at 09:34
You can but it requires at least local mode "cluster". So it is not non-Spark environment. – Nov 21 '16 at 11:35
@LostInOverflow yes and check my another answer with working sample – backtrack Nov 21 '16 at 11:55

Loading a pyspark ML model in a non-Spark environment

3 Answers3