I am trying to implement the ANNOY library in python. I have the following code, which runs fine:
from annoy import AnnoyIndex
class GeneratorAnnoy:
def __init__(self, vector_dict: dict):
self.vector_dict = vector_dict
self.vector_list = list(self.vector_dict.values())
self.vector_length = len(self.vector_list[0])
# Generate conversion tables
self.obj_id_to_index = {}
self.index_to_obj_id = {}
for index, obj_id in enumerate(list(self.vector_dict.keys())):
self.obj_id_to_index[obj_id] = index
self.index_to_obj_id[index] = obj_id
def build_forest(self):
# Init list of item vectors
self.annoy_index = AnnoyIndex(self.vector_length, 'euclidean')
for i, vector in enumerate(self.vector_list):
self.annoy_index.add_item(i, vector)
# Build forest of TREE_COUNT trees
self.annoy_index.build(20)
def compute_distances(self, obj_id: int, count: int):
"""
Compute nearest neighbours given an obj_id.
:param int obj_id: the id of the object to which the distances should be computed.
:return list indices: the indices of the nearest neighbours.
:return list distances: the distances to the nearest neighbours.
"""
index = self.obj_id_to_index[obj_id]
result = self.annoy_index.get_nns_by_item(index, count, include_distances=True)
return result
def compute_similarities(self, count: int):
similarities = {}
for obj_id in self.vector_dict.keys():
indices, distances = self.compute_distances(obj_id, count)
# Create similarities dict
# Template: dict[base_id][recommendation_id] = distance
similarities[obj_id] = {}
for i in range(count):
index = indices[i]
similarities[obj_id][self.index_to_obj_id[index]] = distances[i]
return similarities
vector_dict = {48: [0.0, 1.0], 55: [0.0, -1.0]}
count = len(vector_dict)
generator = GeneratorAnnoy(vector_dict)
generator.build_forest()
similarities = generator.compute_similarities(count)
print('similarities = ', similarities)
Result:
{48: {48: 0.0, 55: 2.0}, 55: {55: 0.0, 48: 2.0}}
However, while looking into the source code of ANNOY (https://github.com/spotify/annoy/blob/master/src/annoylib.h) i found that there are more options than just euclidean. Since I want to apply ANNOY to multi dimensional vectors (approximately 300 dimensions), I found online that angular is the better method to use. The only problem when I change it to self.annoy_index = AnnoyIndex(self.vector_length, 'angular')
is that it returns the same result, even though I expected it to return 1.57 (0.5*pi). Why is this the case and how can I fix this? While looking into the source code I also saw that results can be normalised, how can I trigger this in the code above?