Item-item recommendation based on cosine similarity

Question

As a part of a recommender system that I am building, I want to implement a item-item recommendation based on cosine similarity. Ideally, I would like to compute the cosine similarity on 1 million items represented by a DenseVector of 2048 features in order to get the top-n most similar items to a given one.

My problem is that the solutions I've come across perform poorly on my dataset.

I've tried :

Calculating the cosine similarity between all the rows of a dataframe in pyspark
Using columnSimilarities() from mllib.linalg.distributed
Reducing dimensionality with PCA

Here is the solution using columnSimilarities()

import pyspark
from pyspark.sql import SparkSession
from pyspark.ml.feature import PCA
from pyspark.mllib.linalg.distributed import IndexedRow, IndexedRowMatrix
from pyspark.sql.functions import row_number

new_df = url_rdd.zip(vector_rdd.map(lambda x:Vectors.dense(x))).toDF(schema=['url','features'])

# PCA
pca = PCA(k=1024, inputCol="features", outputCol="pca_features")
pca_model = pca.fit(new_df)
pca_df = pca_model.transform(new_df)

# Indexing my dataframe
pca_df.createOrReplaceTempView('pca_df')
indexed_df = spark.sql('select row_number() over (order by url) - 1 as id, * from pca_df')

# Computing Cosine Similarity
mat = IndexedRowMatrix(indexed_df.select("id", "pca_features").rdd.map(lambda row: IndexedRow(row.id, row.pca_features.toArray()))).toBlockMatrix().transpose().toIndexedRowMatrix()
cos_mat = mat.columnSimilarities()

Is there a better solution on pyspark to compute the cosine similarity and get the top-n most similar items ?

I see you're using a lambda, therefore forcing deserializing into Python objects. That could be a source of the performance problem. Can you provide the full imports in your example, so that we get an actual [mcve](https://stackoverflow.com/help/mcve) and know which parts are from pyspark and which ones aren't? — Oliver W., Apr 18 '19 at 20:29

score 2 · Accepted Answer · answered Apr 19 '19 at 08:49

2

Consider caching new_df, as you're going over it at least twice (once to fit a model, another time to transform the data).

Additionally, don't forget about the optional threshold you can pass to the columnSimilarities method.

answered Apr 19 '19 at 08:49

Oliver W.

13,169
3
37
50

1

Thanks for the tip! I tried using the threshold but it seems that I couldn't pass argument to ```columnSimilarities``` when using IndexedRowMatrix ( but it works with RowMatrix ) – Copp Apr 19 '19 at 10:59
That is true, the threshold is only acceptable if `columnSimilarities` is being used from RowMatrix, not IndexedRowMatrix. – Maziyar Dec 01 '19 at 14:56

Item-item recommendation based on cosine similarity

1 Answers1