Calculate the cosine distance between two column in Spark

Question

I am using a Python & Spark to solve an issue. I have dataframe containing two columns in a Spark Dataframe Each of the columns contain a scalar of numeric(e.g. double or float) type.

I want to interpret these two column as vector and calculate consine similarity between them. Sofar I only found spark linear algebra that can be used on densevector that are located in cell of the dataframe.

code sample

Code in numpy

import numpy as np
from numpy.linalg import norm
vec = np.array([1, 2])
vec_2 = np.array([2, 1])
angle_vec_vec = (np.dot(vec, vec))/(norm(vec * norm(vec)))
print(angle_vec_vec )

Result should 0.8

How to do this in Spark ?

df_small = spark.createDataFrame([(1, 2), (2, 1)])
df_small.show()

Is there a way to convert a column of double values to a densevector ? Do you see any other soluation to solve my problem ?

score 1 · Accepted Answer · answered Jul 14 '20 at 12:25

1

You can see here a sample that calculates the cosine distance in Scala. The strategy is to represent the documents as a RowMatrix and then use its columnSimilarities() method.

If you want to use PySpark, you can try what's suggested here

answered Jul 14 '20 at 12:25

Oscar Lopez M.

585
3
11

Calculate the cosine distance between two column in Spark

1 Answers1