19

It looks like scipy.spatial.distance.cdist cosine similariy distance:

link to cos distance 1

1 - u*v/(||u||||v||)

is different from sklearn.metrics.pairwise.cosine_similarity which is

link to cos similarity 2

 u*v/||u||||v||

Does anybody know reason for different definitions?

seralouk
  • 30,938
  • 9
  • 118
  • 133
user1700890
  • 7,144
  • 18
  • 87
  • 183
  • The link that you labeled "link to cos similarity 1" is *not* cosine similarity, and it is not called that in the link. It is cosine distance. – Warren Weckesser Oct 14 '19 at 23:43
  • 2
    Think of the trivial case: *distance(X, X)* should be 0, because the distance from *X* to *X* is 0. *similarity(X, X)* should be the *maximum* of the function that measures similariy (1 in this case), because *X* and *X* are as similar as two things can be. – Warren Weckesser Oct 14 '19 at 23:46
  • @WarrenWeckesser, thank you, I fixed the name. – user1700890 Oct 15 '19 at 15:42

1 Answers1

34

Good question but yes, these are 2 different things but connected by the following equation:

Cosine_distance = 1 - cosine_similarity


Why?

Usually, people use the cosine similarity as a similarity metric between vectors. Now, the distance can be defined as 1-cos_similarity.

The intuition behind this is that if 2 vectors are perfectly the same then similarity is 1 (angle=0) and thus, distance is 0 (1-1=0).

Similarly you can define the cosine distance for the resulting similarity value range.

Cosine similarity range: −1 meaning exactly opposite, 1 meaning exactly the same, 0 indicating orthogonality.


References: Scipy wolfram

From scipy

seralouk
  • 30,938
  • 9
  • 118
  • 133
  • Thank you for explanation. Terminology a bit confusing. I feel like cosine distance should be called simply cosine. Cosine similarity distance should be called cosine distance. – user1700890 Oct 14 '19 at 18:34
  • 1
    I agree but this is how it is defined in the engineering/math community. – seralouk Oct 14 '19 at 18:36
  • Yeah, does not make sense to change it now. – user1700890 Oct 14 '19 at 18:37
  • 1
    @user1700890 see the first bullet point [here](https://en.wikipedia.org/wiki/Distance#General_metric), for something to be a *distance* it must satisfy *"d(x,y) = 0 if and only if x = y. i.e.is zero precisely from a point to itself"*. The cosine *distance* satisfies this, cosine *similarity* does not. Hence the terminology. – Dan Oct 15 '19 at 15:50
  • @Dan Thank you Dan. Your explanation makes sense. Interesting how `cosine_similarity` is under `sklearn.metrics` while not being a metric – user1700890 Oct 15 '19 at 15:56
  • 1
    Take a look at the second sentence in [this article](https://en.wikipedia.org/wiki/Similarity_measure), while not strictly a mathematical metric, in stats similarities are colloquially referred to as metrics as they fill similar roles. sklearn's metrics are more like measurements (colloquially). – Dan Oct 15 '19 at 16:05