I am trying to calculate TF-IDF for documents of strings and I am referring http://spark.apache.org/docs/latest/mllib-feature-extraction.html#tf-idf link.
import org.apache.spark.rdd.RDD
import org.apache.spark.SparkContext
import org.apache.spark.mllib.feature.HashingTF
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.feature.IDF
val sc: SparkContext = ...
// Load documents (one per line).
val documents: RDD[Seq[String]] = sc.textFile("...").map(_.split(" ").toSeq)
val hashingTF = new HashingTF()
val tf: RDD[Vector] = hashingTF.transform(documents)
tf.cache()
val idf = new IDF().fit(tf)
val tfidf: RDD[Vector] = idf.transform(tf)
output:
Array((1048576,[1088,35436,98482,1024805],[2.3513752571634776,2.3513752571634776,2.3513752571634776,2.3513752571634776]), (1048576,[49,34227,39165,114066,125344,240472,312955,388260,436506,469864,493361,496101,566174,747007,802226],[2.3513752571634776,2.3513752571634776,2.3513752571634776,2.3513752571634776,2.3513752571634776,2.3513752571634776,2.3513752571634776,2.3513752571634776,2.3513752571634776,2.3513752571634776,2.3513752571634776,2.3513752571634776,2.3513752571634776,2.3513752571634776,2.3513752571634776]),...
With this I am getting a RDD of vectors but I am not able to get any information from this vector about the original strings. Can anyone help me out in mapping indexes to original strings?