1

So I have made an AnnoyIndexer and am running some most_similar queries to find the nearest neighbours of some vectors in a 300dimensional vector space. This is the code for it:

def most_similar(self, vector, num_neighbors):
    """Find the approximate `num_neighbors` most similar items.
    Parameters
    ----------
    vector : numpy.array
        Vector for word/document.
    num_neighbors : int
        Number of most similar items
    Returns
    -------
    list of (str, float)
        List of most similar items in format [(`item`, `cosine_distance`), ... ]
    """

    ids, distances = self.index.get_nns_by_vector(
        vector, num_neighbors, include_distances=True)

    return [(self.labels[ids[i]], 1 - distances[i] / 2) for i in range(len(ids))]

I am wondering why the returned values for the distances are all taken away from 1 and then divided by 2? Surely after doing that, largest/smallest distances are all messed up?

ellie123
  • 11
  • 4

1 Answers1

3

From the documentation of gensim:

"List of most similar items in format [(`item`, `cosine_distance`), ...]"

The distances returned by the AnnoyIndex are the euclidean distance between the vectors. So the method needs to transform the euclidean distances in cosine distances. The cosine distance is equals to 1 - e/2 where e is the euclidean distance value, hence the transformation. See this for a derivation of the equivalence.

Also notice that this transformation does not alter the ordinal relationship between the values, consider 0 < d1 < d2 < 1 then d1/2 < d2/2 and 1 - d1/2 < 1 - d2/2, so if d1 was the distance of o1 and d2 of o2 then after the transformation o1 remains closer to the query vector than o2.

Dani Mesejo
  • 61,499
  • 6
  • 49
  • 76
  • Thank you for your answer, that has really helped. One thing I'm not understanding is that if it is returning the cosine distance between the vectors, why does it give different results to the most_similar method in the WordEmbeddingsKeyedVectors class that uses the COSADD method? Doesn't the COSADD method also just find the cosine distance? – ellie123 Aug 18 '18 at 10:37
  • 1
    The function most_similar from WordEmbeddingsKeyedVectors returns the cosine similarity , cosine distance = 1 - cosine similarity. See the wikipedia entry on [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) for more on the relationship of euclidean distance, cosine distance and cosine similarity. – Dani Mesejo Aug 18 '18 at 20:25