2

In a blog post I read that the following "naive implementation" of cosine similarity should never be used in production, the blog post didn't explain why and I am really curious, can anyone give an explanation?

import numpy as np

def cos_sim(a, b):
    """Takes 2 vectors a, b and returns the cosine similarity according 
    to the definition of the dot product
    """
    dot_product = np.dot(a, b)
    norm_a = np.linalg.norm(a)
    norm_b = np.linalg.norm(b)
    return dot_product / (norm_a * norm_b)

# the counts we computed above
sentence_m = np.array([1, 1, 1, 1, 0, 0, 0, 0, 0]) 
sentence_h = np.array([0, 0, 1, 1, 1, 1, 0, 0, 0])
sentence_w = np.array([0, 0, 0, 1, 0, 0, 1, 1, 1])

# We should expect sentence_m and sentence_h to be more similar
print(cos_sim(sentence_m, sentence_h)) # 0.5
print(cos_sim(sentence_m, sentence_w)) # 0.25
Soroush
  • 1,055
  • 2
  • 18
  • 26
liyuan
  • 543
  • 3
  • 16
  • 2
    I think the author of the blog meant naive for how he represents sentences by vector (i.e. only counting occurence) not about how the cos_sim is computed. – T.Lucas Dec 14 '18 at 08:02
  • That statement by the author is ambiguous, and globally this post has little pedagogical value. –  Dec 14 '18 at 09:37

3 Answers3

1

The function cos_sim is what it should be. The problem is representing sentences with counts. Consider using tf-idf instead.

Soroush
  • 1,055
  • 2
  • 18
  • 26
0

There are a few good reasons to avoid this particular implementation.

The main one for me is that there's no check for 0 vectors. If any of the vectors are all 0s you'll get an division by 0 error.

Another reason might be performance. Computing the norms requires a sqrt which can be fairly expensive. If you know you need to compute cos_sim many times it might be worth normalizing the vectors once and then just use dot product.

Last reason is that there might be dedicated hardware support for executing this operation, that you'll likely not make use of. Just as np.dot and np.linalg.norm will give you some benefits compared to implementing it by yourself.

In general it's a good idea to use a well tested and well supported library. That is unless you want to understand what happens under the hood (the blog post example), or you really know what you are doing.

This question has a few suggestions on library functions that compute cosine similarity and likely address the issues mentioned above: Cosine Similarity between 2 Number Lists

Sorin
  • 11,863
  • 22
  • 26
0

The word naive is used very specifically for the implementation described in the blog (or at least I would like to hope so). There is nothing wrong in using cosine similarity. In fact, it is statistically explainable and also practically proven way of understanding similarity of text structure. Combine it with modern methods of embeddings etc. and you have much more robust similarity framework.

Naive here is more about using just word counts or occurrences to calculate similarity

rishi
  • 2,564
  • 6
  • 25
  • 47