3

I have around 10 K points in 5 dimensional space. We can assume that the points are randomly distributed in space (0,0,0,0,0) and (100,100,100,100,100). Clearly, the whole data set can easily reside in memory.

I would like to know which algorithm for k nearest neighbour would run faster, kd-tree or RTree.

Although I have some very high level idea of these two algorithms, I am not sure which will run faster, and why. I am open to exploring other algorithms if any, which could run fast. Please, if possible, specify why an algorithm may run faster.

Ouroboros
  • 1,432
  • 1
  • 19
  • 41
  • possible duplicate of [Could anyone tell me what's the difference between KD-tree and R-tree](http://stackoverflow.com/questions/4326332/could-anyone-tell-me-whats-the-difference-between-kd-tree-and-r-tree) – Has QUIT--Anony-Mousse Dec 16 '13 at 18:41
  • 2
    To the down voters. I think this question is well posed. When it comes to acceleration structures, their theoretical logic and real performance can often be very disparate. In fact the impetus for R tree was Guttmann's need for an acceleration structure that performs well on disks with slow random fetches. This concept can be extended to caches as well. – Vectorized Jul 07 '16 at 00:42

1 Answers1

5

This depends on various parameters. Most importantly on your capability to implement these algorithms.

I've personally found bulk-loaded R*-trees to be faster for large data, probably because they have a better fan-out. Bulk-loaded R-trees is a more fair comparison, as kd-trees are commonly bulk-loaded (in fact, they don't support incremental operation very well at all).

For tiny data, kd-trees will likely be faster, plus they are much simpler to implement.

For other things, please refer to this earlier question / answer:

https://stackoverflow.com/a/11109467/1060350

Community
  • 1
  • 1
Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194