I have a Spark Dataframe
with two columns: id
and hash_vector
.
The id
is the id for a document and hash_vector
is a SparseVector
of word counts corresponding to the document (and has size 30000). There are ~100000 rows (one for each document) in the Dataframe
.
Now, I want to find similarities between every pair of documents. For this I want to compute cosine similarities from the column hash_vector
. I may also want to try other similarity measures like the Jaccard index. What would be a good way of doing this? I am using PySpark. I have a few ideas:
- I can use
columnSimilarities
to find the pairwise dot products. But I read that it is more efficient for a corpus with size_of_vocabulary >> number_of_documents (which is not the case here) - I can loop through the
Dataframe
rows, and for the i-th row, add the i-th row as a columnnew_column
to theDataframe
, then write audf
which finds the similarity (cosine or Jaccard) on the two columns:hash_vector
andnew_column
. But I read that looping through rows beats all purpose of using Spark. - Lastly, I store only similarities above a certain threshold. Since I have a lot of documents, I can expect the matrix of similarities to be quite sparse.
I know this is a broad question. But I am interested in knowing how an expert would think of going about this. I appreciate any direction.