4

I have produced an IDFModel with PySpark and ipython notebook as follows:

from pyspark import SparkContext
from pyspark.mllib.feature import HashingTF
from pyspark.mllib.feature import IDF

hashingTF = HashingTF()   #this will be used with hashing later

txtdata_train = sc.wholeTextFiles("/home/ubuntu/folder").sortByKey() #this returns RDD of (filename, string) pairs for each file from the directory

split_data_train = txtdata_train.map(parse) #my parse function puts RDD in form I want

tf_train = hashingTF.transform(split_data_train) #creates term frequency sparse vectors for the training set

tf_train.cache()

idf_train = IDF().fit(tf_train)    #makes IDFmodel, THIS IS WHAT I WANT TO SAVE!!!

tfidf_train = idf_train.transform(tf_train)

This is based on this guide https://spark.apache.org/docs/1.2.0/mllib-feature-extraction.html. I would like to save this model to load it again at a later time within a different notebook. However, there is no information how to do this, the closest I find is:

Save Apache Spark mllib model in python

But when I tried the suggestion in the answer

idf_train.save(sc, "/home/ubuntu/newfolder")

I get the error code

AttributeError: 'IDFModel' object has no attribute 'save'

Is there something I am missing or is it not possible to solve IDFModel objects? Thanks!

Community
  • 1
  • 1
Matt
  • 1,196
  • 1
  • 9
  • 22
  • I am using Spark 1.2.0 built for Hadoop 2.4.0 – Matt Aug 31 '15 at 18:34
  • Take a look to the [docs](https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html). `IDFModel` does not have a `save` method, while the model in the other SO question, `RandomForestModel`, does have it... – lrnzcig Aug 31 '15 at 20:26
  • You're right, thanks, it would be a worthwhile addition – Matt Aug 31 '15 at 21:45
  • any idea how to come up with my own method of saving it? – Matt Sep 01 '15 at 20:07
  • I don't think it would be easy... but sources are available. [Here](https://github.com/apache/spark/blob/master/python/pyspark/mllib/classification.py) at line 420 you've got an example of a model that can be saved. [Here](https://github.com/apache/spark/blob/master/python/pyspark/mllib/feature.py) at line 408 it's what you want to save. I'd bet `jmodel` is either a java model or can be converted to a java model. Too late in my timezone to give it a try. – lrnzcig Sep 01 '15 at 20:28
  • +Matt - I believe I gave you the answer you needed. Can you mark my comment as the solution ? – jarasss Jan 12 '16 at 01:02

1 Answers1

1

I did something like that in Scala/Java. It seems to work, but might be not very efficient. The idea is to write a file as a serialized object and read it back later. Good Luck! :)

try {
  val fileOut:FileOutputStream  = new FileOutputStream(savePath+"/idf.jserialized");
  val out:ObjectOutputStream  = new ObjectOutputStream(fileOut);
  out.writeObject(idf);
  out.close();
  fileOut.close();
  System.out.println("\nSerialization Successful... Checkout your specified output file..\n");
} catch {
  case foe:FileNotFoundException => foe.printStackTrace()
  case ioe:IOException => ioe.printStackTrace()
}
jarasss
  • 508
  • 3
  • 13