17

I have a database with 500,000 points in a 100 dimensional space, and I want to find the closest 2 points. How do I do it?

Update: Space is Euclidean, Sorry. And thanks for all the answers. BTW this is not homework.

jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • 2
    Out of interest, where did you get a 100-dimensional space? – Will A Oct 10 '10 at 05:13
  • 2
    the question lacks clarity. is this a mathematical question? – Sarmaad Oct 10 '10 at 05:14
  • 18
    @Sarmaad The question may lack many things, but it does have clarity: after reading 1 sentence I understand the problem completely. (although the type of space isn't mentioned, Euclidian is usually assumed by default) – Nikita Rybak Oct 10 '10 at 05:45
  • related: http://stackoverflow.com/q/2486093 Beware that KDTree's approach in your case requires ~5 days of computations. – jfs Oct 11 '10 at 00:16
  • what kind of performance are you looking for? – Unreason Oct 11 '10 at 12:50
  • @J.F.Sebastian: I am going for an approximate k-NN solution. Using the ANN library suggested by @dalle below. But I plan to reduce the dimensions to 20 using PCA first as per @Hasan Khan's suggestions. – Edwin Jose Palathinkal Oct 11 '10 at 14:31
  • @Unreason: The performance I am looking for is something that can run within a week or so. – Edwin Jose Palathinkal Oct 11 '10 at 14:32
  • @louzer, well if you brute force it you would need 500 000 * 499 999 / 2 comparissons which is 1.25e11 comparisons, with each taking 400 floating point operations, which brings it to 5e13. In one day you have around 86400 seconds, so you would need 5.8e8 FLOPS and I think current processors are approximately there. With some cycles spent on other things I guess you could brute force it in a week. – Unreason Oct 11 '10 at 15:33
  • 1
    @louzer: here's brute force approach using KDTree and multiprocessing http://ideone.com/Z7uSc (you could test it against your solution for small number of points) – jfs Oct 11 '10 at 18:11
  • @J.F.Sebastian, you should put it as an answer... – Unreason Oct 12 '10 at 07:12
  • @Will, I got a 104 dimensional space when I plotted all proteins according to certain functional & structural parameters. I was able to derive a very small dataset with just the closest evolutionary relatives by solving the k-NN problem. I ran a classifier on those relatives to derive high precision & high recall signatures common to all proteins of certain properties. – Edwin Jose Palathinkal Dec 06 '10 at 04:42

5 Answers5

17

There's a chapter in Introduction to Algorithms devoted to finding two closest points in two-dimensional space in O(n*logn) time. You can check it out on google books. In fact, I suggest it for everyone as the way they apply divide-and-conquer technique to this problem is very simple, elegant and impressive.

Although it can't be extended directly to your problem (as constant 7 would be replaced with 2^101 - 1), it should be just fine for most datasets. So, if you have reasonably random input, it will give you O(n*logn*m) complexity where n is the number of points and m is the number of dimensions.

edit
That's all assuming you have Euclidian space. I.e., length of vector v is sqrt(v0^2 + v1^2 + v2^2 + ...). If you can choose metric, however, there could be other options to optimize the algorithm.

Russell Dias
  • 70,980
  • 5
  • 54
  • 71
Nikita Rybak
  • 67,365
  • 22
  • 157
  • 181
  • Divide and conquer only works well when you have much more points then dimensions. You safe most from limiting the cross search over both halves by the shortest you found in the both halves. But here is the problem the shortest found might be longer then the point farthest away from the border chosen. And in the end you save nearly nothing until still go n^2 with even more overhead. – Lee Jul 21 '23 at 15:20
7

Use a kd tree. You're looking at a nearest neighbor problem and there are highly optimized data structures for handling this exact class of problems.

http://en.wikipedia.org/wiki/Kd-tree

P.S. Fun problem!

Stefan Mai
  • 23,367
  • 6
  • 55
  • 61
6

You could try the ANN library, but that only gives reliable results up to 20 dimensions.

dalle
  • 18,057
  • 5
  • 57
  • 81
6

Run PCA on your data to convert vectors from 100 dimensions to say 20 dimensions. Then create a K-Nearest Neighbor tree (KD-Tree) and get the closest 2 neighbors based on euclidean distance.

Generally if no. of dimensions are very large then you have to either do a brute force approach (parallel + distributed/map reduce) or a clustering based approach.

Muhammad Hasan Khan
  • 34,648
  • 16
  • 88
  • 131
4

Use the data structure known as a KD-TREE. You'll need to allocate a lot of memory, but you may discover an optimization or two along the way based on your data.

http://en.wikipedia.org/wiki/Kd-tree.

My friend was working on his Phd Thesis years ago when he encountered a similar problem. His work was on the order of 1M points across 10 dimensions. We built a kd-tree library to solve it. We may be able to dig-up the code if you want to contact us offline.

Here's his published paper: http://www.elec.qmul.ac.uk/people/josh/documents/ReissSelbieSandler-WIAMIS2003.pdf

selbie
  • 100,020
  • 15
  • 103
  • 173
  • kdtrees make it easy to find a nearest neighbor to a given point in O(log n) time, as I remember. Is there an optimization to find the closest pair of points in less than O(n log n)? – rampion Oct 10 '10 at 14:46
  • 2
    -1, also according to wikipedia kD-tree are efficient if N >> 2^k (where k is dimensions and N number of points; in this case 2^100 >> 5e5 and the answer is completely misleading) – Unreason Oct 12 '10 at 13:54
  • 10d is not 100d. Even if the data points lie roughly in a 10-d plane in 100d, kd-tree can't work (imho): think of a kd-tree 100 s deep. – denis Feb 01 '11 at 14:59