So I have made an AnnoyIndexer and am running some most_similar queries to find the nearest neighbours of some vectors in a 300dimensional vector space. This is the code for it:
def most_similar(self, vector, num_neighbors):
"""Find the approximate `num_neighbors` most similar items.
Parameters
----------
vector : numpy.array
Vector for word/document.
num_neighbors : int
Number of most similar items
Returns
-------
list of (str, float)
List of most similar items in format [(`item`, `cosine_distance`), ... ]
"""
ids, distances = self.index.get_nns_by_vector(
vector, num_neighbors, include_distances=True)
return [(self.labels[ids[i]], 1 - distances[i] / 2) for i in range(len(ids))]
I am wondering why the returned values for the distances are all taken away from 1 and then divided by 2? Surely after doing that, largest/smallest distances are all messed up?