Cosine Similarity for two pyspark dataframes

Question

I have a PySpark DataFrame, df1, that looks like:

CustomerID  CustomerValue CustomerValue2 
12          .17           .08

I have a second PySpark DataFrame, df2

 CustomerID  CustomerValue CustomerValue
 15          .17           .14
 16          .40           .43
 18          .86           .09

I want to take the cosine similarity of the two dataframes. And have something like that

 CustomerID  CustomerID   CosineCustVal CosineCustVal
 15          12           1            .90
 16          12           .45          .67
 18          12           .8           .04

Have you tried anything? https://datascience.stackexchange.com/questions/13347/calculate-cosine-similarity-in-apache-spark — Pushkr, Sep 27 '18 at 19:08
Yes unfortunately but It couldn't work. The other solution I thought was to transform the first df in a list and take the cosine similarity but I would like to avoid this way — Nektaria.M, Sep 27 '18 at 19:12
Your first dataframe has only one row? Also, which is the common column on which you have decided to join the two because `CustomerID` does not have any common values. — mayank agrawal, Sep 28 '18 at 10:15
Yes it has only one row. The thing is that I don't want to join the two dfs because indeed they don't have the same CustomerID @mayankagrawal — Nektaria.M, Sep 28 '18 at 12:53
If it has only a single row, it is best to collect it in list or similar structure and then calculate. No need to create a spark dataframe for a single row. — mayank agrawal, Sep 28 '18 at 14:01

enys · Answer 1 · 2018-09-28T22:48:48.807

You can calculate cosine similarity only for two vectors, not for two numbers. That said, if the columns called CustomerValue are the different components of a vector that represents the feature you want to get the similarities for between two customers, you can do it by transposing the data frame and then do a join on the CuatomerValues.

The transposition can be done with an explode (more details about transposing a data frame here):

from pyspark.sql import functions as F

kvs = F.explode(F.array([
        F.struct(F.lit(c).alias('key'), F.columm(c).alias('value')) for c in ['CustomerValue1', 'CustomerValue2']
      ])).alias('kvs')

dft1 = (df1.select(['CustomerID', kvs])
        .select('CustomerID', F.column('kvs.name').alias('column_name'), F.column('kvs.value').alias('column_value'))
        )
dft2 = (df2.select(['CustomerID', kvs])
        .select('CustomerID', F.column('kvs.name').alias('column_name'), F.column('kvs.value').alias('column_value'))
        )

where dft1 and dft2 denote the transposed data frames. Once you transposed them, you can join them on the column names:

dft2 = (dft2.withColumnRenamed('CustomerID', 'CustomerID2')
        .withColumnRenamed('column_value', 'column_value2')
       )
cosine = (dft1.join(dft2, dft1.column_name = dft2.column_name)
          .groupBy('CustomerID' , 'CustomerID2')
          .agg(F.sum(F.column('column_value')*F.column('column_value2')).alias('cosine_similarity'))
         )

Now in cosine you have three columns: the CustomerID from the first and second data frames and the cosine similarity (provided that the values were normalized first). This has the oadvantage that you only have rows for CustomerID pairs that have a nonzero similarity (in case of zero values for some CustomerIDs). For your example:

df1:

CustomerID CustomerValue CustomerValue2
12         .17           .08

df2:

CustomerID CustomerValue CustomerValue
15         .17           .14
16         .40           .43
18         .86           .09

cosine:

CustomID CustomID2 cosine_similarity
12       15        .0401
12       16        .1024
12       18        .1534

Of course these are not the real cosine similarities yet, you need to normalize the values first. You can do that with a group by:

(df.groupBy('CustomerID')
 .agg(F.sqrt(F.sum(F.column('column_value')*F.column('column_value'))).alias('norm'))
 .select('CustomerID', F.column('column_name'), (F.column('column_value')/F.column('norm')).alias('column_value_norm'))
)

After normalizing the columns your cosine similarities become the following:

CustomID CustomID2 cosine_similarity
12       15        .970
12       16        .928
12       18        .945

The large similarity values are due to the low dimensionality (two components only).

Cosine Similarity for two pyspark dataframes

1 Answers1

Linked