0

How can I make a pure NumPy function that will return an array of the shape of the 2 arrays with the cosine similarities of all the pairwise comparisons of the rows of the input array?

I don't want to return a single value.

dataSet1 = [5, 6, 7, 2]
dataSet2 = [2, 3, 1, 15]

def cosine_similarity(list1, list2):
  # How to?
  pass

print(cosine_similarity(dataSet1, dataSet2))
Henry Ecker
  • 34,399
  • 18
  • 41
  • 57

3 Answers3

0

You can use scipy for this as stated in this answer.

from scipy import spatial

dataSet1 = [5, 6, 7, 2]
dataSet2 = [2, 3, 1, 15]
result = 1 - spatial.distance.cosine(dataSet1, dataSet2)
Helge Schneider
  • 483
  • 5
  • 8
0

You can also use the cosine_similarity function from sklearn.

from sklearn.feature_extraction.text import CountVectorizer ##if the documents are text
from sklearn.metrics.pairwise import cosine_similarity
def cos(docs):
    if len(docs)==1:
        return []

    cos_final = []
    count_vectorizer= CountVectorizer(tokenizer=tokenize)
    doc1= ['missing' if x is np.nan else x for x in docs]
    count_vec=count_vectorizer.fit_transform(doc1)
    #print(count_vec)
    cosine_sim_matrix= cosine_similarity(count_vec)

    

    #print(cosine_sim_matrix)
    return cosine_sim_matrix
Will A
  • 1
  • 3
0

What you are searching for is cosine_similarity from sklearn library.

Here is a simple example:

Lets we have x which has 5 dimensional 3 vectors and y which has only 1 vector. We can compute cosine similarity as follows:

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

x = np.random.rand(3,5)
y = np.random.rand(1,5)

# >>> x
# array([[0.21668023, 0.05705532, 0.6391782 , 0.97990692, 0.90601101],
#        [0.82725409, 0.30221347, 0.98101159, 0.13982621, 0.88490538],
#        [0.09895812, 0.19948788, 0.12710054, 0.61409403, 0.56001643]])
# >>> y
# array([[0.70531146, 0.10222257, 0.6027328 , 0.87662291, 0.27053804]])

cosine_similarity(x, y)

Then the output is the cosine similarity of each vector from x (3) with y (1) so the output has 3x1 values:

array([[0.84139047],
       [0.75146312],
       [0.75255157]])
Ersel Er
  • 731
  • 6
  • 22