8

I have a Spark Dataframe with two columns: id and hash_vector. The id is the id for a document and hash_vector is a SparseVector of word counts corresponding to the document (and has size 30000). There are ~100000 rows (one for each document) in the Dataframe.

Now, I want to find similarities between every pair of documents. For this I want to compute cosine similarities from the column hash_vector. I may also want to try other similarity measures like the Jaccard index. What would be a good way of doing this? I am using PySpark. I have a few ideas:

  1. I can use columnSimilarities to find the pairwise dot products. But I read that it is more efficient for a corpus with size_of_vocabulary >> number_of_documents (which is not the case here)
  2. I can loop through the Dataframe rows, and for the i-th row, add the i-th row as a column new_column to the Dataframe, then write a udf which finds the similarity (cosine or Jaccard) on the two columns: hash_vector and new_column. But I read that looping through rows beats all purpose of using Spark.
  3. Lastly, I store only similarities above a certain threshold. Since I have a lot of documents, I can expect the matrix of similarities to be quite sparse.

I know this is a broad question. But I am interested in knowing how an expert would think of going about this. I appreciate any direction.

Community
  • 1
  • 1
SashaGreen
  • 193
  • 1
  • 2
  • 9

1 Answers1

0

Have you tried a crossjoin of the table with itself? For instance

# table alias can be used to disambiguate identically named columns
df_a = df_original.alias('a')
df_b = df_original.alias('b')

# list all possible combinations
# then ignore where it's the same line on both sides
# ordered, so we don't process both (A,B) and then (B,A)
df_cross = df_a.crossJoin(df_b).filter('a.id < b.id')

# now apply a udf
df_similar = df_cross.withColumn('similarity', similarity_udf(col('a.hash_vector'), col('b.hash_vector')))
Monstah
  • 395
  • 4
  • 10