Linking the resulting TFIDF sparse vectors to the original documents in Spark

Question

I am calculating the TFIDF using Spark with Python using the following code:

    hashingTF = HashingTF()
    tf = hashingTF.transform(documents)
    idf = IDF().fit(tf)
    tfidf = idf.transform(tf)
    for k in tfidf.collect(): 
      print(k)

I got the following results for three documents:

    (1048576,[558379],[1.43841036226])
    (1048576,[181911,558379,959994],  [0.287682072452,0.287682072452,0.287682072452])
    (1048576,[181911,959994],[0.287682072452,0.287682072452])

Assuming that I have thousands of documents, how to link the resulting TFIDF sparse vectors to the original documents knowing that I don't care about reversing the Hash-keys to the original terms.

score 1 · Accepted Answer · edited May 23 '17 at 12:31

1

Since both documents and tfidf have the same shape (number of partitions, number of elements per partition) and there no operations which require shuffle you can simply zip both RDDs:

documents.zip(tfidf)

Reversing HashingTF is for an obvious reason not possible.

edited May 23 '17 at 12:31

Community

1
1

answered Feb 29 '16 at 20:30

zero323

322,348
103
959
935

I can control the number of partitions but how can I control the number of elements per partition? – K.Ali Mar 01 '16 at 07:40
You cannot. Well... You can apply different low level transformations but there is no way to do it directly. That is why `zip` is applicable only in some limited cases like this. Otherwise you need unique identifiers and join. – zero323 Mar 01 '16 at 07:45
Ah,identifiers and join, a very good point for me and I have an identifier with my documents. but how can I enforce the code above to include it inside the tfidf RDD? – K.Ali Mar 01 '16 at 08:18

Linking the resulting TFIDF sparse vectors to the original documents in Spark

1 Answers1