3

Say I have a huge (a few million) list of n vectors, given a new vector, I need to find a pretty close one from the set but it doesn't need to be the closest. (Nearest Neighbor finds the closest and runs in n time)

What algorithms are there that can approximate nearest neighbor very quickly at the cost of accuracy?

EDIT: Since it will probably help, I should mention the data are pretty smooth most of the time, with a small chance of spikiness in a random dimension.

Nathan
  • 6,095
  • 10
  • 45
  • 54
  • What is the dimension of the vectors? What distance function are you using? –  Feb 17 '11 at 22:24
  • 5 dimensional. I'm just using a generalization of the Pythagorean theorem. – Nathan Feb 17 '11 at 22:54
  • 1
    This survey might be useful: http://www.almaden.ibm.com/u/kclarkson/nn_survey/p.pdf –  Feb 17 '11 at 23:48

4 Answers4

3

There are exist faster algoritms then O(n) to search closest element by arbitary distance. Check http://en.wikipedia.org/wiki/Kd-tree for details.

yura
  • 14,489
  • 21
  • 77
  • 126
  • +1 I was going to suggest kd-tree. Finds nearest neighbor and can also find the k nearest as well in time O(log(n)+k). – phkahler Feb 18 '11 at 21:05
  • KD Trees are *much* faster if you quit early, after looking at say 1000 of 1m points. See the runtimes for cutoff 1000, 5000 under [nearest-neighbors-in-high-dimensional-data](http://stackoverflow.com/questions/5751114/nearest-neighbors-in-high-dimensional-data), also good links on LSH there. – denis Apr 29 '11 at 09:36
2

If you are using high-dimension vector, like SIFT or SURF or any descriptor used in multi-media sector, I suggest your consider LSH.

A PhD dissertation from Wei Dong (http://www.cs.princeton.edu/cass/papers/cikm08.pdf) might help you find the updated algorithm of KNN search, i.e, LSH. Different from more traditional LSH, like E2LSH (http://www.mit.edu/~andoni/LSH/) published earlier by MIT researchers, his algorithm uses multi-probing to better balance the trade-off between recall rate and cost.

HMK
  • 564
  • 6
  • 14
1

For approximate nearest neighbour, the fastest way is to use locality sensitive hashing (LSH). There are many variants of LSHs. You should choose one depending on the distance metric of your data. The big-O of the query time for LSH is independent of the dataset size (not considering time for output result). So it is really fast. This LSH library implements various LSH for L2 (Euclidian) space.

Now, if the dimension of your data is less than 10, kd tree is preferred if you want exact result.

ekzhu
  • 141
  • 1
  • 5
1

A web search on "nearest neighbor" lsh library finds http://www.mit.edu/~andoni/LSH/ http://www.cs.umd.edu/~mount/ANN/ http://msl.cs.uiuc.edu/~yershova/MPNN/MPNN.htm

mcdowella
  • 19,301
  • 2
  • 19
  • 25