How to save IDFmodel with PySpark

Question

I have produced an IDFModel with PySpark and ipython notebook as follows:

from pyspark import SparkContext
from pyspark.mllib.feature import HashingTF
from pyspark.mllib.feature import IDF

hashingTF = HashingTF()   #this will be used with hashing later

txtdata_train = sc.wholeTextFiles("/home/ubuntu/folder").sortByKey() #this returns RDD of (filename, string) pairs for each file from the directory

split_data_train = txtdata_train.map(parse) #my parse function puts RDD in form I want

tf_train = hashingTF.transform(split_data_train) #creates term frequency sparse vectors for the training set

tf_train.cache()

idf_train = IDF().fit(tf_train)    #makes IDFmodel, THIS IS WHAT I WANT TO SAVE!!!

tfidf_train = idf_train.transform(tf_train)

This is based on this guide https://spark.apache.org/docs/1.2.0/mllib-feature-extraction.html. I would like to save this model to load it again at a later time within a different notebook. However, there is no information how to do this, the closest I find is:

Save Apache Spark mllib model in python

But when I tried the suggestion in the answer

idf_train.save(sc, "/home/ubuntu/newfolder")

I get the error code

AttributeError: 'IDFModel' object has no attribute 'save'

Is there something I am missing or is it not possible to solve IDFModel objects? Thanks!

Take a look to the [docs](https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html). `IDFModel` does not have a `save` method, while the model in the other SO question, `RandomForestModel`, does have it... — lrnzcig, Aug 31 '15 at 20:26
I don't think it would be easy... but sources are available. [Here](https://github.com/apache/spark/blob/master/python/pyspark/mllib/classification.py) at line 420 you've got an example of a model that can be saved. [Here](https://github.com/apache/spark/blob/master/python/pyspark/mllib/feature.py) at line 408 it's what you want to save. I'd bet `jmodel` is either a java model or can be converted to a java model. Too late in my timezone to give it a try. — lrnzcig, Sep 01 '15 at 20:28
+Matt - I believe I gave you the answer you needed. Can you mark my comment as the solution ? — jarasss, Jan 12 '16 at 01:02

score 1 · Accepted Answer · answered Dec 09 '15 at 22:37

I did something like that in Scala/Java. It seems to work, but might be not very efficient. The idea is to write a file as a serialized object and read it back later. Good Luck! :)

try {
  val fileOut:FileOutputStream  = new FileOutputStream(savePath+"/idf.jserialized");
  val out:ObjectOutputStream  = new ObjectOutputStream(fileOut);
  out.writeObject(idf);
  out.close();
  fileOut.close();
  System.out.println("\nSerialization Successful... Checkout your specified output file..\n");
} catch {
  case foe:FileNotFoundException => foe.printStackTrace()
  case ioe:IOException => ioe.printStackTrace()
}

And how do u load it? doesnt seem to be serialized when loaded. — Mpizos Dimitris, May 08 '19 at 12:24

How to save IDFmodel with PySpark

1 Answers1