4

I'm actually working on high dimensional data (~50.000-100.000 features) and nearest neighbors search must be performed on it. I know that KD-Trees has poor performance as dimensions grows, and also I've read that in general, all space-partitioning data structures tends to perform exhaustive search with high dimensional data.

Additionally, there are two important facts to be considered (ordered by relevance):

  • Precision: The nearest neighbors must be found (not approximations).
  • Speed: The search must be as fast as possible. (The time to create the data structure isn't really important).

So, I need some advice about:

  1. The data structure to perform k-NN.
  2. If it will be better to use an aNN (approximate nearest neighbor) approach, setting it as accurate as possible?.
gsamaras
  • 71,951
  • 46
  • 188
  • 305
mavillan
  • 365
  • 1
  • 2
  • 13

2 Answers2

2

Can I perform NN search in high dimensional space?

No. Because of the curse of dimensionality, data structures that perform Nearest Neighbour search good in lower dimensions, fail to perform well in a high dimensional place. As a matter of fact, the query times becomes almost equally to the brute force one, thus it is worthless.

As a result, in a high dimensional space, one should go for Approximate Nearest Neighbour (ANN) search. To be honest, it's a must.

Which data structure to perform ANN?

I would suggest LSH, or a number of RKD trees. In my answer here, I mention some good libraries that perform ANN in C++. However, notice that LSH solved the R-nearest neighbour problem, thus you specify the parameter R, which is actually the radius. Then, LSH will look for NN's inside that R from the query point, thus you can't really request k NN's.

On the other hand, RKD trees can do that and return you k NN's. I have a project, that builds a forest of RKD trees and performs ANN search in C++, but it targets only high dimensions. It can handle GIST datasets of 10^6 images in 960 dimensions in < 1 sec with about 90% outputs being true nearest neighbours. The name is kd-GeRaF. It will be updated in the next month with a distributed version, but it's already tested and ready to use. It also has a cute logo. :)


I also feel that you should read my answer, which says that the optimal data structure depends on the data.

Community
  • 1
  • 1
gsamaras
  • 71,951
  • 46
  • 188
  • 305
  • do you have to re-construct the tree after each insertion? That's the problem with kd-tree I want to avoid in my work right now. Tks – EyeQ Tech Jun 17 '17 at 07:28
  • No @TaxiNoiBaiHaNoi. However you need to be cautious on how you insert data, because it may unbalance the tree (given that is already balanced). Read more in [Wikipedia](https://en.wikipedia.org/wiki/K-d_tree#Adding_elements). – gsamaras Jun 17 '17 at 11:42
0

I don't think it would be wise to conduct clustering in such high dimensional data. There are curse of dimension problem.

The concept of distance becomes less precise as the number of dimensions grows, since the distance between any two points in a given dataset converges

I suggest you find a good measure of the distance, rather than direct Euclidean distance on the high dimension space.

Some possible solution are listed in this page, https://en.wikipedia.org/wiki/Clustering_high-dimensional_data

2.1 Subspace clustering

2.2 Projected clustering

2.3 Hybrid approaches

2.4 Correlation clustering

Community
  • 1
  • 1
William
  • 598
  • 4
  • 10
  • I'm not doing and don't want to do clustering. I'm implementing an object recognition system, and there are just one sample per class. So the best approach for this cases is nearest neighbor search. – mavillan Aug 22 '15 at 05:13
  • I think what you are looking for is one-shot learning, https://en.wikipedia.org/wiki/One-shot_learning. You could also conduct deep learning algorithm to reduce the dimension. – William Aug 22 '15 at 12:44