How to find the closest 2 points in a 100 dimensional space with 500,000 points?

Question

I have a database with 500,000 points in a 100 dimensional space, and I want to find the closest 2 points. How do I do it?

Update: Space is Euclidean, Sorry. And thanks for all the answers. BTW this is not homework.

the question lacks clarity. is this a mathematical question? — Sarmaad, Oct 10 '10 at 05:14
@Sarmaad The question may lack many things, but it does have clarity: after reading 1 sentence I understand the problem completely. (although the type of space isn't mentioned, Euclidian is usually assumed by default) — Nikita Rybak, Oct 10 '10 at 05:45
related: http://stackoverflow.com/q/2486093 Beware that KDTree's approach in your case requires ~5 days of computations. — jfs, Oct 11 '10 at 00:16
@J.F.Sebastian: I am going for an approximate k-NN solution. Using the ANN library suggested by @dalle below. But I plan to reduce the dimensions to 20 using PCA first as per @Hasan Khan's suggestions. — Edwin Jose Palathinkal, Oct 11 '10 at 14:31
@Unreason: The performance I am looking for is something that can run within a week or so. — Edwin Jose Palathinkal, Oct 11 '10 at 14:32
@louzer, well if you brute force it you would need 500 000 * 499 999 / 2 comparissons which is 1.25e11 comparisons, with each taking 400 floating point operations, which brings it to 5e13. In one day you have around 86400 seconds, so you would need 5.8e8 FLOPS and I think current processors are approximately there. With some cycles spent on other things I guess you could brute force it in a week. — Unreason, Oct 11 '10 at 15:33
@louzer: here's brute force approach using KDTree and multiprocessing http://ideone.com/Z7uSc (you could test it against your solution for small number of points) — jfs, Oct 11 '10 at 18:11
@Will, I got a 104 dimensional space when I plotted all proteins according to certain functional & structural parameters. I was able to derive a very small dataset with just the closest evolutionary relatives by solving the k-NN problem. I ran a classifier on those relatives to derive high precision & high recall signatures common to all proteins of certain properties. — Edwin Jose Palathinkal, Dec 06 '10 at 04:42

score 17 · Answer 1 · edited Oct 10 '10 at 06:22

There's a chapter in Introduction to Algorithms devoted to finding two closest points in two-dimensional space in O(n*logn) time. You can check it out on google books. In fact, I suggest it for everyone as the way they apply divide-and-conquer technique to this problem is very simple, elegant and impressive.

Although it can't be extended directly to your problem (as constant 7 would be replaced with 2^101 - 1), it should be just fine for most datasets. So, if you have reasonably random input, it will give you O(n*logn*m) complexity where n is the number of points and m is the number of dimensions.

edit
That's all assuming you have Euclidian space. I.e., length of vector v is sqrt(v0^2 + v1^2 + v2^2 + ...). If you can choose metric, however, there could be other options to optimize the algorithm.

Divide and conquer only works well when you have much more points then dimensions. You safe most from limiting the cross search over both halves by the shortest you found in the both halves. But here is the problem the shortest found might be longer then the point farthest away from the border chosen. And in the end you save nearly nothing until still go n^2 with even more overhead. — Lee, Jul 21 '23 at 15:20

Stefan Mai · Answer 2 · 2010-10-10T06:24:24.493

7

Use a kd tree. You're looking at a nearest neighbor problem and there are highly optimized data structures for handling this exact class of problems.

http://en.wikipedia.org/wiki/Kd-tree

P.S. Fun problem!

edited Oct 10 '10 at 06:24

answered Oct 10 '10 at 06:05

Stefan Mai

23,367
6
55
61

score 6 · Accepted Answer · answered Oct 10 '10 at 05:50

6

You could try the ANN library, but that only gives reliable results up to 20 dimensions.

answered Oct 10 '10 at 05:50

dalle

18,057
5
57
81

Thanks. ANN is just what I was looking for. Hopefully it can hold everything in RAM. – Edwin Jose Palathinkal Oct 11 '10 at 14:36
ANN is easy to use, but it should be noted that it is an approximate nearest neighbor implementation, so isn't guaranteed to be correct. – chuck taylor Oct 12 '10 at 19:36

Muhammad Hasan Khan · Answer 4 · 2010-10-10T06:28:52.503

6

Run PCA on your data to convert vectors from 100 dimensions to say 20 dimensions. Then create a K-Nearest Neighbor tree (KD-Tree) and get the closest 2 neighbors based on euclidean distance.

Generally if no. of dimensions are very large then you have to either do a brute force approach (parallel + distributed/map reduce) or a clustering based approach.

edited Oct 10 '10 at 06:28

answered Oct 10 '10 at 06:00

Muhammad Hasan Khan

34,648
16
88
131

Thanks. I am reducing the dimensions as per your suggestions. – Edwin Jose Palathinkal Oct 11 '10 at 14:35
1

If you do run PCA 100 -> 20 dimensions, be sure to check the fraction of variance, sum( 20 eigenvalues ) / sum(all). – denis Feb 01 '11 at 14:50

score 4 · Answer 5 · answered Oct 10 '10 at 06:19

4

Use the data structure known as a KD-TREE. You'll need to allocate a lot of memory, but you may discover an optimization or two along the way based on your data.

http://en.wikipedia.org/wiki/Kd-tree.

My friend was working on his Phd Thesis years ago when he encountered a similar problem. His work was on the order of 1M points across 10 dimensions. We built a kd-tree library to solve it. We may be able to dig-up the code if you want to contact us offline.

Here's his published paper: http://www.elec.qmul.ac.uk/people/josh/documents/ReissSelbieSandler-WIAMIS2003.pdf

answered Oct 10 '10 at 06:19

selbie

100,020
15
103
173

kdtrees make it easy to find a nearest neighbor to a given point in O(log n) time, as I remember. Is there an optimization to find the closest pair of points in less than O(n log n)? – rampion Oct 10 '10 at 14:46
2

-1, also according to wikipedia kD-tree are efficient if N >> 2^k (where k is dimensions and N number of points; in this case 2^100 >> 5e5 and the answer is completely misleading) – Unreason Oct 12 '10 at 13:54
10d is not 100d. Even if the data points lie roughly in a 10-d plane in 100d, kd-tree can't work (imho): think of a kd-tree 100 s deep. – denis Feb 01 '11 at 14:59

How to find the closest 2 points in a 100 dimensional space with 500,000 points?

5 Answers5

Linked