1

I am trying to implement the ANNOY library in python. I have the following code, which runs fine:

from annoy import AnnoyIndex


class GeneratorAnnoy:

    def __init__(self, vector_dict: dict):
        self.vector_dict = vector_dict
        self.vector_list = list(self.vector_dict.values())
        self.vector_length = len(self.vector_list[0])

        # Generate conversion tables
        self.obj_id_to_index = {}
        self.index_to_obj_id = {}
        for index, obj_id in enumerate(list(self.vector_dict.keys())):
          self.obj_id_to_index[obj_id] = index
          self.index_to_obj_id[index] = obj_id

    def build_forest(self):
      # Init list of item vectors
      self.annoy_index = AnnoyIndex(self.vector_length, 'euclidean')  
      for i, vector in enumerate(self.vector_list):
          self.annoy_index.add_item(i, vector)

      # Build forest of TREE_COUNT trees
      self.annoy_index.build(20)

    def compute_distances(self, obj_id: int, count: int):
        """
        Compute nearest neighbours given an obj_id.

        :param int obj_id: the id of the object to which the distances should be computed.

        :return list indices: the indices of the nearest neighbours.
        :return list distances: the distances to the nearest neighbours.
        """

        index = self.obj_id_to_index[obj_id]
        result = self.annoy_index.get_nns_by_item(index, count, include_distances=True)
        return result

    def compute_similarities(self, count: int):
        similarities = {}
        for obj_id in self.vector_dict.keys():
            indices, distances = self.compute_distances(obj_id, count)
            
            # Create similarities dict
            # Template: dict[base_id][recommendation_id] = distance
            similarities[obj_id] = {}
            for i in range(count):
                index = indices[i]
                similarities[obj_id][self.index_to_obj_id[index]] = distances[i]

        return similarities
        
vector_dict = {48: [0.0, 1.0], 55: [0.0, -1.0]}

count = len(vector_dict)

generator = GeneratorAnnoy(vector_dict)
generator.build_forest()
similarities = generator.compute_similarities(count)

print('similarities = ', similarities)

Result:

{48: {48: 0.0, 55: 2.0}, 55: {55: 0.0, 48: 2.0}}

However, while looking into the source code of ANNOY (https://github.com/spotify/annoy/blob/master/src/annoylib.h) i found that there are more options than just euclidean. Since I want to apply ANNOY to multi dimensional vectors (approximately 300 dimensions), I found online that angular is the better method to use. The only problem when I change it to self.annoy_index = AnnoyIndex(self.vector_length, 'angular') is that it returns the same result, even though I expected it to return 1.57 (0.5*pi). Why is this the case and how can I fix this? While looking into the source code I also saw that results can be normalised, how can I trigger this in the code above?

0 Answers0