I am calculating TF and IDF using HashingTF and IDF of Pyspark using the following code:
from pyspark import SparkContext
from pyspark.mllib.feature import HashingTF
from pyspark.mllib.feature import IDF
sc = SparkContext()
# Load documents (one per line).
documents = sc.textFile("random.txt").map(lambda line: line.split(" "))
hashingTF = HashingTF()
tf = hashingTF.transform(documents)
tf.cache()
idf = IDF(minDocFreq=2).fit(tf)
tfidf = idf.transform(tf)
The question is: I can print tfidf on the screen using collect() method but how can I access specific data inside it or save the whole tfidf vectorspace to an external file or Dataframe?