15

I have a MxN array, where M is the number of observations and N is the dimensionality of each vector. From this array of vectors, I need to calculate the mean and minimum euclidean distance between the vectors.

In my mind, this requires me to calculate MC2 distances, which is an O(nmin(k, n-k)) algorithm. My M is ~10,000 and my N is ~1,000, and this computation takes ~45 seconds.

Is there a more efficient way to compute the mean and min distances? Perhaps a probabilistic method? I don't need it to be exact, just close.

jonrsharpe
  • 115,751
  • 26
  • 228
  • 437
japata
  • 329
  • 1
  • 9
  • 1
    http://stackoverflow.com/questions/12108181/calculate-the-maximum-distance-between-vectors-in-an-array – Mitch Wheat Mar 19 '17 at 03:36
  • 1
    Can you post your current code? In my head, I'm only seeing O(m^2*n), perhaps I'm misunderstanding something. – pgreen2 Mar 19 '17 at 03:37
  • Interesting question. However, I'm not sure where you got the variables C_2 and k from. As pgreen2 mentioned, I see an O(n*m^2) algorithm as the most straight forward approach. – Filip Kilibarda Mar 19 '17 at 17:16
  • Full disclosure, I'm a novice at algorithms, so it's probable that I'm wrong in my analysis. However, I can't see a way around computing the distance between every possible pair of vectors, which results in `M` choose `2` computations. From Mitch's link, I can see how the brute force method would be O(k*n^2). I can also understand how to efficiently compute the `maximum` distance of the set, but not the `minimum` or the `average`. Thanks for the quick responses :) – japata Mar 19 '17 at 22:43
  • If i normalized all of the vectors to be unit length, would it be less computationally expensive to compute these metrics? I'm trying to compute a measure of the dispersion of these vectors on the hyper-sphere around the origin, with the maximum value being where the minimum distance from a point P to the other points in the set is the same for every point. E.G. the points are uniformly distributed on the sphere. – japata Mar 19 '17 at 23:24
  • Unit length wouldn't speed it up. There may actually be a way to do this in something like O(nlogn), maybe this will help you (they're using 2D points). http://stackoverflow.com/questions/1602164/shortest-distance-between-points-algorithm. Otherwise, writing a c extension or using the `numba` package will significantly speed up your code. I might come back to this once I have some time. Make sure to use @filipkilibarda if you'd like me to be notified of your comments. Otherwise I won't see them. – Filip Kilibarda Mar 20 '17 at 08:05
  • 5
    Here's a long, but thorough discussion and proof for solving this problem in O(nlogn). https://www.cs.ucsb.edu/~suri/cs235/ClosestPair.pdf – Filip Kilibarda Mar 21 '17 at 02:39
  • @FilipKilibarda But this is only for the closest pair. But even in 1D I do not quickly see how to efficiently compute the mean pairwise distance, i.e., the sum of the pairwise distances. – S. Huber Nov 02 '17 at 22:04
  • @FilipKilibarda Should have looked more thoroughly: In 1D one can consider the n-1 segments between the n points and count for each segment how often it contributes to a pairwise distance: The number of points left to it times the number of points right to it. So in 1D one can exploit the fact that the sum of pairwise distances splits up nicely into a sum of multiplies of smaller distances. However, starting with 2D this does not seem to be so easy. – S. Huber Nov 02 '17 at 22:48
  • [[calculate MC2 distances, which is an O(nmin(k, n-k)) ]] I don't quite follow. You have MC2 computations for distances, each of which takes order of N. So, it should be N x MC2. (Did I miss something ?) – Axeon Thra Feb 13 '19 at 01:19

3 Answers3

1

You didn't describe where your vectors come from, nor what use you will put mean and median to. Here are some observations about the general case. Limited ranges, error tolerance, and discrete values may admit of a more efficient approach.

The mean distance between M points sounds quadratic, O(M^2). But M / N is 10, fairly small, and N is huge, so the data probably resembles a hairy sphere in 1e3-space. Computing centroid of M points, and then computing M distances to centroid, might turn out to be useful in your problem domain, hard to tell.

The minimum distance among M points is more interesting. Choose a small number of pairs at random, say 100, compute their distance, and take half the minimum as an estimate of the global minimum distance. (Validate by comparing to the next few smallest distances, if desired.) Now use spatial UB-tree to model each point as a positive integer. This involves finding N minima for M x N values, adding constants so min becomes zero, scaling so estimated global min distance corresponds to at least 1.0, and then truncating to integer.

With these transformed vectors in hand, we're ready to turn them into a UB-tree representation that we can sort, and then do nearest neighbor spatial queries on the sorted values. For each point compute an integer. Shift the low-order bit of each dimension's value into the result, then iterate. Continue iterating over all dimensions until non-zero bits have all been consumed and appear in the result, and proceed to the next point. Numerically sort the integer result values, yielding a data structure similar to a PostGIS index.

Now you have a discretized representation that supports reasonably efficient queries for nearest neighbors (though admittedly N=1e3 is inconveniently large). After finding two or more coarse-grained nearby neighbors, you can query the original vector representation to obtain high-resolution distances between them, for finer discrimination. If your data distribution turns out to have a large fraction of points that discretize to being off by single bit from nearest neighbor, e.g. location of oxygen atoms where each has a buddy, then increase the global min distance estimate so the low order bits offer adequate discrimination.

A similar discretization approach would be appropriately scaling e.g. 2-dimensional inputs and marking an initially empty grid, then scanning immediate neighborhoods. This relies on global min being within a "small" neighborhood, due to appropriate scaling. In your case you would be marking an N-dimensional grid.

J_H
  • 17,926
  • 4
  • 24
  • 44
1

You may be able to speed things up with some sort of Space Partitioning.

For the minimum distance calculation, you would only need to consider pairs of points in the same or neigbouring partitions. For an approximate mean, you might be able to come up with some sort of weighted average based on the distances between partitions and the number of points within them.

Martin Stone
  • 12,682
  • 2
  • 39
  • 53
-2

I had the same issue before, and it worked for me once I normalized the values. So try to normalize the data before calculating the distance.

  • @Srinivas746 the magnitude of the vector elements may have some bearing on the floating point stability, but the question was about time complexity. – awiebe Oct 25 '18 at 03:20