1

I am calculating the TFIDF using Spark with Python using the following code:

    hashingTF = HashingTF()
    tf = hashingTF.transform(documents)
    idf = IDF().fit(tf)
    tfidf = idf.transform(tf)
    for k in tfidf.collect(): 
      print(k)

I got the following results for three documents:

    (1048576,[558379],[1.43841036226])
    (1048576,[181911,558379,959994],  [0.287682072452,0.287682072452,0.287682072452])
    (1048576,[181911,959994],[0.287682072452,0.287682072452])

Assuming that I have thousands of documents, how to link the resulting TFIDF sparse vectors to the original documents knowing that I don't care about reversing the Hash-keys to the original terms.

zero323
  • 322,348
  • 103
  • 959
  • 935
K.Ali
  • 283
  • 4
  • 15

1 Answers1

1

Since both documents and tfidf have the same shape (number of partitions, number of elements per partition) and there no operations which require shuffle you can simply zip both RDDs:

documents.zip(tfidf)

Reversing HashingTF is for an obvious reason not possible.

Community
  • 1
  • 1
zero323
  • 322,348
  • 103
  • 959
  • 935
  • I can control the number of partitions but how can I control the number of elements per partition? – K.Ali Mar 01 '16 at 07:40
  • You cannot. Well... You can apply different low level transformations but there is no way to do it directly. That is why `zip` is applicable only in some limited cases like this. Otherwise you need unique identifiers and join. – zero323 Mar 01 '16 at 07:45
  • Ah,identifiers and join, a very good point for me and I have an identifier with my documents. but how can I enforce the code above to include it inside the tfidf RDD? – K.Ali Mar 01 '16 at 08:18