4

I wrote a basic O(n^2) algorithm for a nearest neighbor search. As usual Matlab 2013a's knnsearch(..) method works a lot faster.

Can someone tell me what kind of optimization they used in their implementation?

I am okay with reading any documentation or paper that you may point me to.

PS: I understand the documentation on the site mentions the paper on kd trees as a reference. But as far as I understand kd trees are the default option when column number is less than 10. Mine is 21. Correct me if I'm wrong about it.

Roy
  • 316
  • 4
  • 11
  • If you want to see where the difference is made, try using the `profile` on your code as well as the matlab code, and see which parts are significantly different. – Dennis Jaheruddin Apr 24 '14 at 09:22

2 Answers2

7

The biggest optimization MathWorks have made in implementing nearest-neighbors search is that all the hard stuff is implemented in a MEX file, as compiled C, rather than MATLAB.

With an algorithm such as kNN that (in my limited understanding) is quite recursive and difficult to vectorize, that's likely to give such an improvement that the O() analysis will only be relevant at pretty high n.

In more detail, under the hood the knnsearch command uses createns to create a NeighborSearcher object. By default, when X has less than 10 columns, this will be a KDTreeSearcher object, and when X has more than 10 columns it will be an ExhaustiveSearcher object (both KDTreeSearcher and ExhaustiveSearcher are subclasses of NeighborSearcher).

All objects of class NeighbourSearcher have a method knnsearch (which you would rarely call directly, using instead the convenience command knnsearch rather than this method). The knnsearch method of KDTreeSearcher calls straight out to a MEX file for all the hard work. This lives in matlabroot\toolbox\stats\stats\@KDTreeSearcher\private\knnsearchmex.mexw64.

As far as I know, this MEX file performs pretty much the algorithm described in the paper by Friedman, Bentely, and Finkel referenced in the documentation page, with no structural changes. As the title of the paper suggests, this algorithm is O(log(n)) rather than O(n^2). Unfortunately, the contents of the MEX file are not available for inspection to confirm that.

Sam Roberts
  • 23,951
  • 1
  • 40
  • 64
  • Actually kNN is straightforward to implement fully vectorized in a few lines of code (see [here for an example](http://stackoverflow.com/a/10865856/97160)). Basically we compute the distance matrix between the query instance(s) and all training points (exhaustive search), sort by the distances, take the closest `K` points, then apply majority vote to determine the class label. The hard work is done by `pdist2` which could be [vectorized into a single `bsxfun` call](http://stackoverflow.com/a/7774323/97160) (although not recommended for large datasets as you could easily run out of memory) – Amro Apr 25 '14 at 21:34
  • @Amro You're talking about just the exhaustive search algorithm for kNN here, correct? - yes, that's clearly vectorizable. But was I wrong to claim that the KDTree algorithm is fairly recursive and difficult to vectorize? As I said, my understanding of that is limited. (Actually I've just realized that my second paragraph refers to kNN in general rather than just KDTrees, which is what I meant). – Sam Roberts Apr 26 '14 at 11:35
  • right, I was referring to the naive kNN algorithm (exhaustive search). Building and searching the KD-tree is an iterative algorithm and not easily vectorizable (if at all). I guess it was bit unclear which one you were talking about in the 2nd paragraph... For anyone interested, here is [a page](http://www.cs.cmu.edu/~awm/animations/kdtree/) with a short tutorial on kd-trees and some nice animations. Btw the IPT toolbox also has a private implementation of kd-trees in `$toolboxdir\images\images\private\kdtree.m` with a corresponding MEX-function `nnsearch` to perform the search – Amro Apr 26 '14 at 13:27
2

The code builds a KD-tree space-partitioning structure to speed up nearest neighbor search, think of it like building indexes commonly used in RDBMS to speed up lookup operations.

In addition to nearest neighbor(s) searches, this structure also speeds up range-searches, which finds all points that are within a distance r from a query point.

As pointed by @SamRoberts, the core of the code is implemented in C/C++ as a MEX-function.

Note that knnsearch chooses to build a KD-tree only under certain conditions, and falls back to an exhaustive search otherwise (by naively searching all points for the nearest one).

Keep in mind that in cases of very high-dimensional data (and few instances), the algorithm degenerates and is no better than an exhaustive search. In general as you go with dimensions d>30, the cost of searching KD-trees will increase to searching almost all the points, and could even become worse than a brute force search due to the overhead involved in building the tree.

There are other variations to the algorithm that deals with high dimensions such as the ball trees which partitions the data in a series of nesting hyper-spheres (as opposed to partitioning the data along Cartesian axes like KD-trees). Unfortunately those are not implemented in the official Statistics toolbox. If you are interested, here is a paper which presents a survey of available kNN algorithms.

kdtree_search

(The above is an illustration of searching a kd-tree partitioned 2d space, borrowed from the docs)

Amro
  • 123,847
  • 25
  • 243
  • 454