Calculating the cosine similarity between all rows of a large dataframe in pyspark

Question

I'm looking for a solution on how to compute Cosine Similarity using PySpark between rows on a large dataset without using Pandas.

data is a pivot table containing row_id, feature, score from df

df -

data -

I've calculated it using Pandas and sklearn cosine_similarity:

from sklearn.metrics.pairwise import cosine_similarity

data = df.groupBy("row_id").pivot("feature").max("score").fillna(0).toPandas()

cs = cosine_similarity(data.drop('row_id'))

This answer would be a good solution, returning a N x N matrix with similarity bewteen each row to another. But on large datasets, it isn't scalable. Clusters running only on one worker with low usage taking forever to run this simple action.

Is there a way to computing it using spark only in order to use spark capabilities instead of running this asynchronously using Pandas?

The end result would be an NxN spark dataframe computing cosine similarity between each row based on features score.

How large `N` are you expecting? If we are talking about 1 million rows X 1 million columns, Spark is definitely not the way to go. Can you give me some indication on the expected shape of the `data` table? — Richard Nemeth, Jan 19 '20 at 16:01
What you are asking for is actually a cross-join on your DataFrame and cosine similarity but I wouldn't advise that. This is a very expensive computing phase and it won't scale at any level. You ought taking a look at (Local Sensitive Hashing](https://spark.apache.org/docs/latest/ml-features#locality-sensitive-hashing). It's very good approximation technique to approximate nearest neighbor search... You might also want to check this solution https://stackoverflow.com/questions/43921636/apache-spark-python-cosine-similarity-over-dataframes?rq=1 It works mainly on Tall and Skinny matrices... — eliasah, Jan 19 '20 at 21:35
Thanks for your response, @RichardNemeth - I can make it iterable making it around N between 5K and 15K. eliasah - I will try to investigate what you offered. — Matan Sheffer, Jan 20 '20 at 08:19
Does this answer your question? [Calculate Cosine Similarity Spark Dataframe](https://stackoverflow.com/questions/47010126/calculate-cosine-similarity-spark-dataframe) — David Taub, Jan 20 '20 at 11:11
@DavidTaub - it's written in Scala, I need the same answer using PySpark. most of my code written in python and I'm not familiar with Scala. — Matan Sheffer, Jan 20 '20 at 12:05

Calculating the cosine similarity between all rows of a large dataframe in pyspark

0 Answers0