I am calculating the TFIDF using Spark with Python using the following code:
hashingTF = HashingTF()
tf = hashingTF.transform(documents)
idf = IDF().fit(tf)
tfidf = idf.transform(tf)
for k in tfidf.collect():
print(k)
I got the following results for three documents:
(1048576,[558379],[1.43841036226])
(1048576,[181911,558379,959994], [0.287682072452,0.287682072452,0.287682072452])
(1048576,[181911,959994],[0.287682072452,0.287682072452])
Assuming that I have thousands of documents, how to link the resulting TFIDF sparse vectors to the original documents knowing that I don't care about reversing the Hash-keys to the original terms.