16

Imagine you are given set S of n points in 3 dimensions. Distance between any 2 points is simple Euclidean distance. You want to chose subset Q of k points from this set such that they are farthest from each other. In other words there is no other subset Q’ of k points exists such that min of all pair wise distances in Q is less than that in Q’.

If n is approximately 16 million and k is about 300, how do we efficiently do this?

My guess is that this NP-hard so may be we just want to focus on approximation. One idea I can think of is using Multidimensional scaling to sort these points in a line and then use version of binary search to get points that are furthest apart on this line.

Shital Shah
  • 63,284
  • 17
  • 238
  • 185
  • k is about 300. – Shital Shah Feb 22 '18 at 11:01
  • 1
    Any information on the distribution of points? If a plot of the points looked like a circular or oval cloud then you might fit an oval to the cloud; work out k' equi-distant points on the circumference then find the k points closest to the k'. There is a datastructure that speeds up fionding points near another point - can't remember the name at the moment. That ovals size might need iterating over for better fit too. – Paddy3118 Feb 22 '18 at 12:24
  • Indeed, in two dimensions there's an O(N) solution for points located on a circle. Extension to a sphere or some other approximate convex surface may be possible. – Richard Mar 31 '20 at 15:35
  • More generally, I'd worry a lot about the distribution because since N>>k for many distributions it will be easy to rule out many points as being part of the set. – Richard Mar 31 '20 at 15:38
  • If you had a representative smaller sample of points you, and others could investigate algorithms... – Paddy3118 Apr 01 '20 at 03:14

4 Answers4

18

This is known as the discrete p-dispersion (maxmin) problem.

The optimality bound is proved in White (1991) and Ravi et al. (1994) give a factor-2 approximation for the problem with the latter proving this heuristic is the best possible (unless P=NP).

Factor-2 Approximation

The factor-2 approximation is as follows:

Let V be the set of nodes/objects.
Let i and j be two nodes at maximum distance.
Let p be the number of objects to choose.
P = set([i,j])
while size(P)<p:
  Find a node v in V-P such that min_{v' in P} dist(v,v') is maximum.
  \That is: find the node with the greatest minimum distance to the set P.
  P = P.union(v)
Output P

You could implement this in Python like so:

#!/usr/bin/env python3

import numpy as np

p = 50
N = 400

print("Building distance matrix...")
d = np.random.rand(N,N) #Random matrix
d = (d + d.T)/2         #Make the matrix symmetric

print("Finding initial edge...")
maxdist  = 0
bestpair = ()
for i in range(N):
  for j in range(i+1,N):
    if d[i,j]>maxdist:
      maxdist = d[i,j]
      bestpair = (i,j)

P = set()
P.add(bestpair[0])
P.add(bestpair[1])

print("Finding optimal set...")
while len(P)<p:
  print("P size = {0}".format(len(P)))
  maxdist = 0
  vbest = None
  for v in range(N):
    if v in P:
      continue
    for vprime in P:
      if d[v,vprime]>maxdist:
        maxdist = d[v,vprime]
        vbest   = v
  P.add(vbest)

print(P)

Exact Solution

You could also model this as an MIP. For p=50, n=400 after 6000s, the optimality gap was still 568%. The approximation algorithm took 0.47s to obtain an optimality gap of 100% (or less). A naive Gurobi Python representation might look like this:

#!/usr/bin/env python
import numpy as np
import gurobipy as grb

p = 50
N = 400

print("Building distance matrix...")
d = np.random.rand(N,N) #Random matrix
d = (d + d.T)/2             #Make the matrix symmetric

m = grb.Model(name="MIP Model")

used  = [m.addVar(vtype=grb.GRB.BINARY) for i in range(N)]

objective = grb.quicksum( d[i,j]*used[i]*used[j] for i in range(0,N) for j in range(i+1,N) )

m.addConstr(
  lhs=grb.quicksum(used),
  sense=grb.GRB.EQUAL,
  rhs=p
)

# for maximization
m.ModelSense = grb.GRB.MAXIMIZE
m.setObjective(objective)

# m.Params.TimeLimit = 3*60

# solving with Glpk
ret = m.optimize()

Scaling

Obviously, the O(N^2) scaling for the initial points is bad. We can find them more efficiently by recognizing that the pair must lie on the convex hull of the dataset. This gives us an O(N log N) way to find the pair. Once we've found it we proceed as before (using SciPy for acceleration).

The best way of scaling would be to use an R*-tree to efficiently find the minimum distance between a candidate point p and the set P. But this cannot be done efficiently in Python since a for loop is still involved.

import numpy as np
from scipy.spatial import ConvexHull
from scipy.spatial.distance import cdist

p = 300
N = 16000000

# Find a convex hull in O(N log N)
points = np.random.rand(N, 3)   # N random points in 3-D

# Returned 420 points in testing
hull = ConvexHull(points)

# Extract the points forming the hull
hullpoints = points[hull.vertices,:]

# Naive way of finding the best pair in O(H^2) time if H is number of points on
# hull
hdist = cdist(hullpoints, hullpoints, metric='euclidean')

# Get the farthest apart points
bestpair = np.unravel_index(hdist.argmax(), hdist.shape)

P = np.array([hullpoints[bestpair[0]],hullpoints[bestpair[1]]])

# Now we have a problem
print("Finding optimal set...")
while len(P)<p:
  print("P size = {0}".format(len(P)))
  distance_to_P        = cdist(points, P)
  minimum_to_each_of_P = np.min(distance_to_P, axis=1)
  best_new_point_idx   = np.argmax(minimum_to_each_of_P)
  best_new_point = np.expand_dims(points[best_new_point_idx,:],0)
  P = np.append(P,best_new_point,axis=0)

print(P)
tarmas99
  • 359
  • 1
  • 11
Richard
  • 56,349
  • 34
  • 180
  • 251
  • `if d[v,vprime]>maxdist` is actually finding the maximum distance between the v and the P. Shouldn't it be minimum? – Md Monjur Ul Hasan Mar 23 '23 at 04:47
  • @MdMonjurUlHasan: We're endeavouring to maximize the minimum distance. – Richard Mar 23 '23 at 15:17
  • Thanks for this post! Can you comment on why Gurobi didn't find the optimal solution? Is it infeasible for [N=400, p = 50] or larger problems? – thc Jul 17 '23 at 17:41
4

I am also pretty sure that the problem is NP-Hard, the most similar problem I found is the k-Center Problem. If runtime is more important than correctness a greedy algorithm is probably your best choice:

Q ={}
while |Q| < k
    Q += p from S where mindist(p, Q) is maximal

Side note: In similar problems e.g., the set-cover problem it can be shown that the solution from the greedy algorithm is at least 63% as good as the optimal solution.

In order to speed things up I see 3 possibilities:

  1. Index your dataset in an R-Tree first, then perform a greedy search. Construction of the R-Tree is O(n log n), but though being developed for nearest neighbor search, it can also help you finding the furthest point to a set of points in O(log n). This might be faster than the naive O(k*n) algorithm.

  2. Sample a subset from your 16 million points and perform the greedy algorithm on the subset. You are approximate anyway so you might be able to spare a little more accuracy. You can also combine this with the 1. algorithm.

  3. Use an iterative approach and stop when you are out of time. The idea here is to randomly select k points from S (lets call this set Q'). Then in each step you switch the point p_ from Q' that has the minimum distance to another one in Q' with a random point from S. If the resulting set Q'' is better proceed with Q'', otherwise repeat with Q'. In order not to get stuck you might want to choose another point from Q' than p_ if you could not find an adequate replacement for a couple of iterations.

SaiBot
  • 3,595
  • 1
  • 13
  • 19
1

If you can afford to do ~ k*n distance calculations then you could

  1. Find the center of the distribution of points.
  2. Select the point furthest from the center. (and remove it from the set of un-selected points).
  3. Find the point furthest from all the currently selected points and select it.
  4. Repeat 3. until you end with k points.
Paddy3118
  • 4,704
  • 27
  • 38
  • Step 3 is the crux of the problem. This naive algorithm won’t scale for large n. – Shital Shah Feb 22 '18 at 11:03
  • I don't think your greedy algorithm works here. Let's say `k=3` and `Q` is 6 points in the same plane. They all belong to a circle's circumference and are evenly distributed there (the distance between 2 "neighbours" is `R`). After you picked your first point, your algorithms takes the furthest one which is the one at the other end of the diameter. As a last point you must pick "a neighbour" (so your min dist will be `R`) while the optimal solution would choose points that create equilateral triangle (so the min would be `R*sqrt(3)`). – kszl Feb 22 '18 at 14:11
  • Hi @BUZZY, that's the nature of greedy algorithms, you want one that gives a good enough result, but they are not guaranteed to be the best result found by checking all possibilities. – Paddy3118 Feb 22 '18 at 14:42
  • I understand that. My problem is that your answer doesn't make it clear it's the best-effort alogithm, not the actual solution. Also, if that's just an approximation, it would be worth adding a note how far it might be from the optimal solution. Otherwise, it's just a set of points picked at random, at least for me. – kszl Feb 22 '18 at 14:46
  • OK @BUZZY,. When reading the question, I had this in mind "My guess is that this NP-hard so may be we just want to focus on approximation" and assumed that the answers would all be assumed to be non-optimal. – Paddy3118 Feb 22 '18 at 17:09
0

Find the maximum extent of all points. Split into 7x7x7 voxels. For all points in a voxel find the point closest to its centre. Return these 7x7x7 points. Some voxels may contain no points, hopefully not too many.

Paddy3118
  • 4,704
  • 27
  • 38