I'm looking for a solution on how to compute Cosine Similarity using PySpark between rows on a large dataset without using Pandas.
data is a pivot table containing row_id, feature, score from df
I've calculated it using Pandas and sklearn cosine_similarity:
from sklearn.metrics.pairwise import cosine_similarity
data = df.groupBy("row_id").pivot("feature").max("score").fillna(0).toPandas()
cs = cosine_similarity(data.drop('row_id'))
This answer would be a good solution, returning a N x N matrix with similarity bewteen each row to another. But on large datasets, it isn't scalable. Clusters running only on one worker with low usage taking forever to run this simple action.
Is there a way to computing it using spark only in order to use spark capabilities instead of running this asynchronously using Pandas?
The end result would be an NxN spark dataframe computing cosine similarity between each row based on features score.