approximate nearest neighbor (A1NN) for high dimension spaces

Question

I read this question about finding the closest neighbor for 3-dimensions points. Octree is a solution for this case.

kd-Tree is a solution for small spaces (generally less than 50 dimensions).

For higher dimensions (vectors of hundreds of dimensions and millions of points) LSH is a popular solution for solving the AKNN (Aproxximate K-NN) problem, as pointed out in this question.

However, LSH is popular for K-NN solutions, where K>>1. For example, LSH has been successfully used for Content Based Image Retrieval (CBIR) applications, where each image is represented through a vector of hundreds of dimensions and the dataset is millions (or billions) of images. In this case, K is the number of top-K most similar images w.r.t. the query image.

But what if we are interested just to the most approximate similar neighbor (i.e. A1-NN) in high dimensional spaces? LSH is still the winner, or ad-hoc solutions have been proposed?

score 3 · Answer 1 · answered Sep 27 '16 at 04:27

You might look at http://papers.nips.cc/paper/2666-an-investigation-of-practical-approximate-nearest-neighbor-algorithms.pdf and http://research.microsoft.com/en-us/um/people/jingdw/pubs%5CTPAMI-TPTree.pdf. Both have figures and graphs showing the perfomance of LSH vs the performance of tree-based methods which also produce only approximate answers, for different values of k including k=1. The Microsoft paper claims that "It has been shown in [34] that randomized KD trees can outperform the LSH algorithm by about an order of magnitude". Table 2 P 7 of the other paper appears to show speedups over LSH which are reasonably consistent for different values of k.

Note that this is not LSH vs kd-trees. This is LSH vs various clever tuned approximate search tree structures, where you typically search only the most promising parts of the tree, and not all of the parts of the tree that could possibly contain the closest point, and you search a number of different trees to get a decent probability of finding good points to compensate for this, tuning various parameters to get the fastest possible performance.

This is unexpected. At this point I feel a little bit lost: LSH was designed for outperform KD-trees (and derived structures) in high dimensional data (like 128-dimensional SIFT descriptors described in the 2nd paper). And now I find out that it's inefficient? What's the point of using LSH then? In addition, the cited paper ([34] in your answer) use E2LSH (which is outdated compared to FALCONN, released in 2015) and find the K-NN by R-near neighbor increasing R. I think that this is not an accurate process. Of course I might be wrong, I'm just asking for an opinion :) — justHelloWorld, Sep 27 '16 at 08:43
I am not up to date on the relative speeds of the best implementations of the various options. One paper https://lear.inrialpes.fr/~douze/enseignement/2014-2015/presentation_papers/muja_flann.pdf seems to be about automatically picking and tuning the best option for particular problems, so presumably there is no clear winner. It is entirely consistent for LSH to outperform KD-trees (which produce exact answers) only to be outperformed itself by a slightly different tree algorithm which only produces approximate answers. Fortunately, very few applications absolutely must have the very fastest. — mcdowella, Sep 27 '16 at 18:15

approximate nearest neighbor (A1NN) for high dimension spaces

1 Answers1