I have produced an IDFModel with PySpark and ipython notebook as follows:
from pyspark import SparkContext
from pyspark.mllib.feature import HashingTF
from pyspark.mllib.feature import IDF
hashingTF = HashingTF() #this will be used with hashing later
txtdata_train = sc.wholeTextFiles("/home/ubuntu/folder").sortByKey() #this returns RDD of (filename, string) pairs for each file from the directory
split_data_train = txtdata_train.map(parse) #my parse function puts RDD in form I want
tf_train = hashingTF.transform(split_data_train) #creates term frequency sparse vectors for the training set
tf_train.cache()
idf_train = IDF().fit(tf_train) #makes IDFmodel, THIS IS WHAT I WANT TO SAVE!!!
tfidf_train = idf_train.transform(tf_train)
This is based on this guide https://spark.apache.org/docs/1.2.0/mllib-feature-extraction.html. I would like to save this model to load it again at a later time within a different notebook. However, there is no information how to do this, the closest I find is:
Save Apache Spark mllib model in python
But when I tried the suggestion in the answer
idf_train.save(sc, "/home/ubuntu/newfolder")
I get the error code
AttributeError: 'IDFModel' object has no attribute 'save'
Is there something I am missing or is it not possible to solve IDFModel objects? Thanks!