3

I have a set of 3 million vectors (300 dimensions each), and I'm looking for a new point in this 300 dim space that is approximately equally distant from all the other points(vectors)

What I could do is initialize a random vector v, and run an optimization over v with the objective: objective function

Where d_xy is the distance between vector x and vector y, but this would be very computationally expensive.

I'm looking for an approximate solution vector for this problem that can be found quickly over very large sets of vectors. (Or any libraries that will do something like this for me- any language)

user8472
  • 726
  • 1
  • 8
  • 16
  • Have you tried something? – farhawa Jun 11 '15 at 10:09
  • @farhawa I tried running a python script that used scipy.optimize.minimize() to minimize the objective function I described above. Of course, it involved 3M distance calculations per iteration, and then an O(n^2) passes over the vector set, so it only worked in reasonable time on tiny sets of vectors (around 10000) – user8472 Jun 11 '15 at 10:14
  • could you put an example of your vectors? – farhawa Jun 11 '15 at 10:16
  • 1
    Did I understand correctly: you want to find the vector v with the smallest sum over all distances |v - i| to all other vectors in your set? Maybe some stochastic method would be suitable, where you check a random sample of vectors i for every v and choose the one with the smallest sum over all distances in regards of that random set. maybe monte-carlo simulation can help. just my two cents (this is not a highly qualified comment, but maybe a hint) http://de.wikipedia.org/wiki/Monte-Carlo-Simulation. – Asking Questions Jun 11 '15 at 10:19
  • @Asking Questions : Sorry if I wasn't too clear-- I'm looking for a vector (not part of my original set of 3M) which is equally far from all the other vectors. I'm not trying to reduce the total sum of distances of my vector from the rest of the set. – user8472 Jun 11 '15 at 10:25
  • Can this be rephrased as find the hypersphere which is closes to your set of points? Your objective function (missing a summation symbol?) has a trivial solution: choose two distinct points p and q, and their midpoint m: (Dmp-Dmq)^2 = 0. Sorry if I missed something. – Stefano M Jun 11 '15 at 14:31
  • @Stefano M, yes I missed a summation. Sorry :( – user8472 Jun 12 '15 at 19:13

2 Answers2

2

From this question on the Math StackExchange:

There is no point that is equidistant from 4 or more points in general position in the plane, or n+2 points in n dimensions.

Criteria for representing a collection of points by one point are considered in statistics, machine learning, and computer science. The centroid is the optimal choice in the least-squares sense, but there are many other possibilities.

The centroid is the point C in the the plane for which the sum of squared distances $\sum |CP_i|^2$ is minimum. One could also optimize a different measure of centrality, or insist that the representative be one of the points (such as a graph-theoretic center of a weighted spanning tree), or assign weights to the points in some fashion and take the centroid of those.

Note, specifically, "the centroid is the optimal choice in the least-squares sense", so the optimal solution to your cost function (which is a least-squares cost) is simply to average all the coordinates of your points (which will give you the centroid).

Community
  • 1
  • 1
EelkeSpaak
  • 2,757
  • 2
  • 17
  • 37
  • I'm not convinced by the (accepted) answer on Math StackExchange. It is easy to construct examples in which a set of points is lying on a hypersphere but the their centroid is far from the centre of the hypersphere itself. – Stefano M Jun 11 '15 at 14:17
  • @StefanoM: yes, but I don't think that answer (or mine) states that? If you construct a set of points lying all near one 'pole' of the hypersphere, then obviously the centroid of the set won't be the centre of the hypersphere. I can't think of any sets of points distributed *uniformly* along a hypersphere where their centroid would not be the hypersphere's centre. – EelkeSpaak Jun 11 '15 at 14:26
  • Agreed, but nowhere an assumption on the spatial distribution of the given points was made... The point here is to know if, for the OP dataset, the centroid is a good or bad choice. The general statement "the centroid is the optimal choice in the least-squares sense" could be misleading. – Stefano M Jun 11 '15 at 14:38
  • 1
    The text you quote is saying that the centroid minimizes the sum of the squared distances to all the points. The OP is asking to minimize the sum over every pair of points (i, j) of the squared difference in distances from the new point to i and j, which is different. As a simple example consider points {0, 3, 3} in 1-dimensional space, with centroid 2. This has value 2 for the OP's objective, which is not optimal (by inspection the optimal solution is 1.5, with objective value 0 since it's equidistant from all points). – josliber Jun 11 '15 at 15:11
  • @josilber Maybe so, but OP's formula doesn't seem to square with his stated objective. – Robert Dodier Jun 11 '15 at 20:09
  • @RobertDodier not sure I follow -- OP says they want to find a point v that is equidistant as possible from the specified points. The objective sums over all pairs of points (i, j), assessing a higher penalty when the d_{vi} and d_{vj} are different (aka i and j are not equidistant from v). This seems to match the stated objective pretty well, imo. – josliber Jun 11 '15 at 20:12
1

I agree that in general this is a pretty tough optimization problem, especially at the scale you're describing. Each objective function evaluation requires O(nm + n^2) work for n points of dimension m -- O(nm) to compute distances from each point to the new point and O(n^2) to compute the objective given the distances. This is pretty scary when m=300 and n=3M. Thus even one function evaluation is probably intractable, not to mention solving the full optimization problem.

One approach that has been mentioned in the other answer is to take the centroid of the points, which can be computed efficiently -- O(nm). A downside of this approach is that it could do terribly at the proposed objective. For instance, consider a situation in 1-dimensional space with 3 million points with value 1 and 1 point with value 0. By inspection, the optimal solution is v=0.5 with objective value 0 (it's equidistant from every point), but the centroid will select v=1 (well, a tiny bit smaller than that) with objective value 3 million.

An approach that I think will do better than the centroid is to optimize each dimension separately (ignoring the existence of the other dimensions). While the objective function is still expensive to compute in this case, a bit of algebra shows that the derivative of the objective is quite easy to compute. It is the sum over all pairs (i, j) where i < v and j > v of the value 4*((v-i)+(v-j)). Remember we're optimizing a single dimension so the points i and j are 1-dimensional, as is v. For each dimension we therefore can sort the data (O(n lg n)) and then compute the derivative for a value v in O(n) time using a binary search and basic algebra. We can then use scipy.optimize.newton to find the zero of the derivative, which will be the optimal value for that dimension. Iterating over all dimensions, we'll have an approximate solution to our problem.

First consider the proposed approach versus the centroid method in a simple setting, with 1-dimensional data points {0, 3, 3}:

import bisect
import scipy.optimize

def fulldist(x, data):
    dists = [sum([(x[i]-d[i])*(x[i]-d[i]) for i in range(len(x))])**0.5 for d in data]
    obj = 0.0
    for i in range(len(data)-1):
        for j in range(i+1, len(data)):
            obj += (dists[i]-dists[j]) * (dists[i]-dists[j])
    return obj

def f1p(x, d):
    lownum = bisect.bisect_left(d, x)
    highnum = len(d) - lownum
    lowsum = highnum * (x*lownum - sum([d[i] for i in range(lownum)]))
    highsum = lownum * (x*highnum - sum([d[i] for i in range(lownum, len(d))]))
    return 4.0 * (lowsum + highsum)

data = [(0.0,), (3.0,), (3.0,)]
opt = []
centroid = []
for d in range(len(data[0])):
    thisdim = [x[d] for x in data]
    meanval = sum(thisdim) / len(thisdim)
    centroid.append(meanval)
    thisdim.sort()
    opt.append(scipy.optimize.newton(f1p, meanval, args=(thisdim,)))
print "Proposed", opt, "objective", fulldist(opt, data)
# Proposed [1.5] objective 0.0
print "Centroid", centroid, "objective", fulldist(centroid, data)
# Centroid [2.0] objective 2.0

The proposed approach finds the exact optimal solution, while the centroid method misses by a bit.

Consider a slightly larger example with 1000 points of dimension 300, with each point drawn from a gaussian mixture. Each point's value is normally distributed with mean 0 and variance 1 with probability 0.1 and normally distributed with mean 100 and variance 1 with probability 0.9:

data = []
for n in range(1000):
    d = []
    for m in range(300):
        if random.random() <= 0.1:
            d.append(random.normalvariate(0.0, 1.0))
        else:
            d.append(random.normalvariate(100.0, 1.0))
    data.append(d)

The resulting objective values were 1.1e6 for the proposed approach and 1.6e9 for the centroid approach, meaning the proposed approach decreased the objective by more than 99.9%. Obviously the differences in the objective value are heavily affected by the distribution of the points.

Finally, to test the scaling (removing the final objective value calculations, since they're in general intractable), I get the following scaling with m=300: 0.9 seconds for 1,000 points, 7.1 seconds for 10,000 points, and 122.3 seconds for 100,000 points. Therefore I expect this should take about 1-2 hours for your full dataset with 3 million points.

josliber
  • 43,891
  • 12
  • 98
  • 133
  • Thank you for the suggestion. Greedy solutions are always a great place to start (this should have hit me as a possible approximation-- some computer science student I am!) Also, that's a very elaborate answer, and I really appreciate the effort you took to write a script and estimate how much time it would take for my dataset. Thank you once again. – user8472 Jun 12 '15 at 19:20