0

I'm writing a pyspark application and would like to use algorithms in MLlib Linear Regression. But I can't figure out a way to save/load the output. My Code:

import os
import sys

os.environ['SPARK_HOME']="C:\spark-2.2.0-bin-hadoop2.7"
try:
    from pyspark.sql import SparkSession
    from pyspark.ml.regression import LinearRegression
    from pyspark.ml.linalg import Vectors
    from pyspark.ml.feature import VectorAssembler
except ImportError as e:
    print ("Error importing Spark Modules", e)
    sys.exit(1)

spark=SparkSession.builder.appName("lrexample").getOrCreate()
data=spark.read.csv("E:/Customers.csv", inferSchema=True, header=True)

assembler=VectorAssembler(inputCols=['Avg Session Length','Time on App','Time on Website','Length of Membership'],outputCol='features')
output=assembler.transform(data)
final_data=output.select('features','Yearly Amount Spent')
train_data,test_data=final_data.randomSplit([0.7,0.3])

lr=LinearRegression(labelCol='Yearly Amount Spent')
lr_model=lr.fit(train_data)

My question is how do I load/save lr_model. I will use HBase

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Talha K.
  • 27
  • 11

1 Answers1

0

I wouldn't know about saving directly to HBase, but nowadays most Spark ML models include methods for saving/loading to/from disk; in your case:

# saving:
lr_model.save(model_path)

# loading:
from pyspark.ml.regression import LinearRegressionModel
model = LinearRegressionModel.load(model_path)

See the documentation and this thread for more.

desertnaut
  • 57,590
  • 26
  • 140
  • 166