3

I am doing a research on approximate nearest neighbor algorithms.

I recently found the Annoy Library which does an amazing job in finding KNN in reasonable speed. For deeper analysis, you can skim the meetup slides.

After reading the slides and inspecting the source code, I cannot see a reason why this algorithm works with high dimensional data better than KD-Tree.

KD-Tree is a great NN algorithm however in high dimensions it achieves a running time which is almost identical to a brute force search [O(n^2)] and thus it requires to check many neighboring leafs.

The reason for that is that in high dimensions, the volume of a unit sphere gets way smaller (you can take a look here).

The only difference I found in the Annoy library is that the partitioning of the space is done by using hyperplanes rather than partition one dimension at a time.

Did some ever analyzed this algorithm and can explain why it is so much faster than KD-Tree?

gsamaras
  • 71,951
  • 46
  • 188
  • 305
lsxliron
  • 540
  • 5
  • 11
  • I'm no expert on this, but Annoy is using a locality sensitive hash to reduce the dimensionality of the input space. A decision about where to search in the lower dimensional space may miss the actual nearest neighbor(s). That's why it's an approximate method. In all likelihood it has the same time bound as a binary space partition for the resulting lower dimensional space. A BSP's performance is roughly the same as a KD tree. – Gene May 09 '16 at 00:08
  • @Gene Thanks for your answer. I don't think it's the case. Annoy is not LSH (even though it uses random projections). Annoy generates a forest of binary trees where every leaf can have at most `k` elements while LSH uses hash tables and signatures to find candidates of similar items. Also, according to the source code it does not uses dimensionality reduction techniques. – lsxliron May 09 '16 at 02:42
  • Hmmm... Okay. The "How it works" section of the documentation references LSH directly. Projection _is_ dimensionality reduction. Sorry I couldn't help. – Gene May 09 '16 at 23:55
  • "thus it requires to check many neighboring leafs." why is that ? KD-tree splits data to 2 equal parts on every step. Am I wrong ? – canbax Jun 07 '19 at 08:09

1 Answers1

2

Read this section of Annoy:

How does it work:

Using random projections and by building up a tree. At every intermediate node in the tree, a random hyperplane is chosen, which divides the space into two subspaces. This hyperplane is chosen by sampling two points from the subset and taking the hyperplane equidistant from them.

We do this k times so that we get a forest of trees. k has to be tuned to your need, by looking at what tradeoff you have between precision and performance.

The key here is the forest of trees, I suppose. You are comparing versus a KD-tree, which is a rather old data structure, and as you said, suffers from the curse of dimensionality.

I think that using a forest of Randomized KD trees would be a good match here.

For example, my kd-GeRaF offers a good explanation on this (see the paper). If however the number of dimensions is relatively small, i.e. around 100, then FLANN should be interesting too. FLANN is older than kd-GeRaF, but inspired me much.


I do not see how LSH is used in Annoy, as suggested in the comments, but if it is, then it's no problem for an RKD forest.

gsamaras
  • 71,951
  • 46
  • 188
  • 305