0

As a part of a recommender system that I am building, I want to implement a item-item recommendation based on cosine similarity. Ideally, I would like to compute the cosine similarity on 1 million items represented by a DenseVector of 2048 features in order to get the top-n most similar items to a given one.

My problem is that the solutions I've come across perform poorly on my dataset.

I've tried :

Here is the solution using columnSimilarities()

import pyspark
from pyspark.sql import SparkSession
from pyspark.ml.feature import PCA
from pyspark.mllib.linalg.distributed import IndexedRow, IndexedRowMatrix
from pyspark.sql.functions import row_number

new_df = url_rdd.zip(vector_rdd.map(lambda x:Vectors.dense(x))).toDF(schema=['url','features'])

# PCA
pca = PCA(k=1024, inputCol="features", outputCol="pca_features")
pca_model = pca.fit(new_df)
pca_df = pca_model.transform(new_df)

# Indexing my dataframe
pca_df.createOrReplaceTempView('pca_df')
indexed_df = spark.sql('select row_number() over (order by url) - 1 as id, * from pca_df')

# Computing Cosine Similarity
mat = IndexedRowMatrix(indexed_df.select("id", "pca_features").rdd.map(lambda row: IndexedRow(row.id, row.pca_features.toArray()))).toBlockMatrix().transpose().toIndexedRowMatrix()
cos_mat = mat.columnSimilarities()

Is there a better solution on pyspark to compute the cosine similarity and get the top-n most similar items ?

Copp
  • 83
  • 1
  • 12
  • 1
    I see you're using a lambda, therefore forcing deserializing into Python objects. That could be a source of the performance problem. Can you provide the full imports in your example, so that we get an actual [mcve](https://stackoverflow.com/help/mcve) and know which parts are from pyspark and which ones aren't? – Oliver W. Apr 18 '19 at 20:29
  • Thanks for your response Oliver, I edited the code. – Copp Apr 19 '19 at 06:36

1 Answers1

2

Consider caching new_df, as you're going over it at least twice (once to fit a model, another time to transform the data).

Additionally, don't forget about the optional threshold you can pass to the columnSimilarities method.

Oliver W.
  • 13,169
  • 3
  • 37
  • 50
  • 1
    Thanks for the tip! I tried using the threshold but it seems that I couldn't pass argument to ```columnSimilarities``` when using IndexedRowMatrix ( but it works with RowMatrix ) – Copp Apr 19 '19 at 10:59
  • That is true, the threshold is only acceptable if `columnSimilarities` is being used from RowMatrix, not IndexedRowMatrix. – Maziyar Dec 01 '19 at 14:56