I am using a Python & Spark to solve an issue. I have dataframe containing two columns in a Spark Dataframe Each of the columns contain a scalar of numeric(e.g. double or float) type.
I want to interpret these two column as vector and calculate consine similarity between them. Sofar I only found spark linear algebra that can be used on densevector that are located in cell of the dataframe.
code sample
Code in numpy
import numpy as np
from numpy.linalg import norm
vec = np.array([1, 2])
vec_2 = np.array([2, 1])
angle_vec_vec = (np.dot(vec, vec))/(norm(vec * norm(vec)))
print(angle_vec_vec )
Result should 0.8
How to do this in Spark ?
df_small = spark.createDataFrame([(1, 2), (2, 1)])
df_small.show()
Is there a way to convert a column of double values to a densevector ? Do you see any other soluation to solve my problem ?