14

I have a set S of n points in dimension d for which I can calculate all pairwise distances if need be. I need to select k points in this set so that the sum of their pairwise distances is maximal. In other, slightly more mathematical words, I want p1, ..., pk in S such that sum(i,j < k) dist(pi, pj) is maximal.

I know this question is related to this one (which is basically the same as mine but for k=2) and maybe to this one (with 'farthest' instead of 'closest').

I am not too sure about that, but maybe all the possible solutions have all their points in the convex hull ?

Any reasonable approximation/heuristic is fine.

Virtual bonus point #1 for a solution which works for any function that gives a score out of the four points (one of which could be the square root of the sum of the squared distances).

Virtual bonus point #2 if the solution is easily implemented in python + numpy/scipy.

Community
  • 1
  • 1
Nathan
  • 508
  • 4
  • 11
  • 1
    Re: All solution poins on convex hull: This is not true, unfortunately, as the convex hull might have less than `k` vertices. – us2012 Oct 01 '13 at 15:50
  • Then, is it true if _k_ < _h_, with _h_ the number of points in the hull ? – Nathan Oct 01 '13 at 16:05
  • Often "farthest"/"longest" is more difficult than "closest" with graphs. Do *not* think that if you can find the closest points/minimum of something than you can also find the farthest points/maximum of something with mostly the same code and same time. – Bakuriu Oct 01 '13 at 17:01
  • I had asked similar question which got some great answers: https://stackoverflow.com/questions/48925086/choosing-subset-of-farthest-points-in-given-set-of-points. The difference in my question is that I wanted to optimize min of distances as opposed to sum of distances. – Shital Shah Oct 20 '21 at 19:22

4 Answers4

6

Your problem seemed similar to the weighted minimum vertex cover problem (which is NP-complete). Thanks to @Gareth Rees for the comments below clarifying that I was incorrect in understanding a vertex cover's relationship to the set you're looking for. But you may still investigate the vertex cover problem and literature because your problem might be discussed alongside it, as they still do share some features.

If you're willing to work with diameter instead of summed graph weight, you could use the approach for the minimal diameter set that you linked in your question. If your current distance measure is called d (the one for which you want the points furthest from each other) then just define d' = 1/d and solve the minimum distance problem with d'.

There might also be a relationship between some form of graph cutting algorithm, like say normalized cut, and the subset you seek. If your distance measure is used as the graph weight or affinity between nodes, you might be able to modify an existing graph cutting objective function to match your objective function (looking for the group of k nodes that have maximum summed weight).

This seems like a combinatorially difficult problem. You might consider something simple like simulated annealing. The proposal function could just choose at point that's currently in the k-subset at random and replace it randomly with a point not currently in the k-subset.

You would need a good cooling schedule for the temperature term and may need to use reheating as a function of cost. But this sort of this is really simple to program. As long as n is reasonably small, you can then just constantly randomly select k-subsets and anneal towards a k-subset with very large total distance.

This would only give you an approximation, but even deterministic methods probably will solve this approximately.

Below is a first hack at what the simulated annealing code might be. Note that I'm not making guarantees about this. It could be an inefficient solution if calculating distance is too hard or the problem instance size grows too large. I'm using very naive geometric cooling with a fixed cooling rate, and you may also want to tinker with a fancier proposal than just randomly swapping around nodes.

all_nodes = np.asarray(...) # Set of nodes
all_dists = np.asarray(...) # Pairwise distances

N = len(all_nodes)
k = 10 # Or however many you want.

def calculate_distance(node_subset, distances):
    # A function you write to determine sum of distances
    # among a particular subset of nodes.    

# Initial random subset of k elements
shuffle = np.random.shuffle(all_nodes) 
current_subset = shuffle[0:k]
current_outsiders = shuffle[k:]

# Simulated annealing parameters.
temp = 100.0
cooling_rate = 0.95
num_iters = 10000

# Simulated annealing loop.
for ii in range(num_iters):
    proposed_subset = current_subset.copy()
    proposed_outsiders =  current_outsiders.copy()

    index_to_swap = np.random.randint(k)
    outsider_to_swap = np.random.randint(N - k)

    tmp = current_subset[index_to_swap]
    proposed_subset[index_to_swap] = current_outsiders[outsider_to_swap]
    proposed_outsiders[outsider_to_swap] = tmp

    potential_change = np.exp((-1.0/temp)*
        calculate_distance(proposed_subset,all_dists)/
        calculate_distance(current_subset, all_dists)) 

    if potential_change > 1 or potential_change >= np.random.rand():
         current_subset = proposed_subset
         current_outsiders = proposed_outsiders

    temp = cooling_rate * temp
ely
  • 74,674
  • 34
  • 147
  • 228
  • Thank you for your advice. I was thinking of a monte-carlo algorithm but I'll try this simulated annealing method for sure. I'll wait a few hours/days to see if any more "specialized" solution comes up, but otherwise I think I'll select this one. – Nathan Oct 02 '13 at 08:50
  • As for the suggestion taking `d' = 1/d`, I am not too sure about that since `1/d` doesn't seem like a true "metric" to me, and I don't know what kind of impact this has on the proposed algorithm. – Nathan Oct 02 '13 at 08:53
  • I don't follow your argument that the OP's problem is NP-complete, nor do I see the connection with MINIMUM VERTEX COVER. (In the MINIMUM VERTEX COVER problem you are trying to pick the *smallest* covering set, but in the OP's problem you must pick *exactly* k points.) Can you expand on this part of your answer? – Gareth Rees Oct 02 '13 at 13:09
  • I don't know that there is a perfect connection, which is why I said, "I am not entirely sure." I'm also talking about the weighted minimum vertex cover. And because the graph necessarily has an edge between any two points (since it is just a distance from a metric space), I believe this avoids the problem you mention. That is, any k-set of points will be a cover, even a single point could be a cover, so then the weight minimization is all that's left. I do not know if that simplification makes it solvable in poly-time, it easily could like the bipartite version for maximal independent set. – ely Oct 02 '13 at 13:16
  • If a graph has an edge between every two vertices, then a vertex cover must contain either all vertices, or all but one. (If a set S is missing two vertices, then those two vertices have an edge between them which is not covered, hence S is not a vertex cover.) So I don't understand your claim "because the graph necessarily has an edge between any two points ... any k-set of points will be a cover". – Gareth Rees Oct 02 '13 at 13:24
  • You are correct, I was mistaken about that. Is it right then that any vertex cover of a complete graph will consist of any N-1 nodes out of the N nodes in the graph? – ely Oct 02 '13 at 13:49
  • @Nathan, you can also take `d' = 1/(d+1)` to avoid any division by zero. If you're looking to solve problem treating the distance as a graph weight, then it won't matter if it is a metric in the metric space. The particular functional form you choose for inverting the distance could matter, depending on what graph algorithm you use, but my prior belief is that as long as your `d'` reverses the ordering of points (ordered by distance away), then result should be robust to the choice of `d'`. – ely Oct 02 '13 at 14:07
  • @EMS I see your point on `d'`. @GaretRees, do you have any intuition yourself of whether this problem is NP-complete or not, or any advice on how to tackle it ? – Nathan Oct 03 '13 at 15:00
6

How about this greedy algorithm:

  1. add to the solution the 2 points with the greatest distance between them in S
  2. until you reach a solution of size k, add to the solution the point for which the sum of distances from it to all the points already in the solution is the greatest.

Lets call the greatest distance between any 2 point D, for each point we add to the solution, we add at least D due to the triangle inequality. so the solution will be at least (k-1)*D, while any solution will have (k-1)^2 distances, none of them exceeds D, so at the worse case you get a solution k times the optimum.

I'm not sure this is the tightest bound that can be proved for this heuristic.

Ron Teller
  • 1,880
  • 1
  • 12
  • 23
  • This is a great suggestion. – ely Oct 01 '13 at 17:03
  • 2
    While this is a great starting point for further optimization, I think this greedy algorithm has some problems. Take a set of 2D points that roughly forms a disk, and k=3. This algorithm takes 2 points on the outer circle that belong to the same diameter, then adds a point at 90 degrees on the circle, which leads to a right triangle that is quite far from the equilateral triangle that is the optimum shape. – Nathan Oct 02 '13 at 08:47
  • 1
    @Nathan, the question is what you mean by "quite far from the optimum shape." Do you have a specific application in mind where the tolerance for non-optimality is quantified? As I would interpret it, the greedy algorithm locating the right triangle is a pretty great approximation of the equilateral triangle, given how simple of an approach this is. Making an unqualified statement such as "the right triangle is not a good approximation of the equilateral triangle" is unhelpful. What, specifically, do you mean by "good approximation"? – ely Oct 02 '13 at 14:54
  • Sorry, I reckon this was a rather subjective statement. I had in mind an error in the distance sum of ~0.1-1% from the optimal solution. – Nathan Oct 03 '13 at 14:57
  • After tinkering with the simulated annealing scheme, I must admit this greedy algorithm produce very good results that are hard to refine. Nice ! – Nathan Oct 04 '13 at 14:55
2

Step 1: Precompute dist(pi, pj) for all pairs pi,pj

Step 2: Construct a complete graph V={p1,...,pn} with edge weights w_ij = dist(pi, pj)

Step 3: Solve the maximum edge-weighted clique (MEC) problem.

MEC is definitely NP-complete, and it is strongly related to quadratic programming. Thus, you might focus on heuristics if n is large (or even of moderate size).

Note that, by precomputing edge-weights, there are no restrictions on the distance function

Tom Swifty
  • 2,864
  • 2
  • 16
  • 25
0

Here's a working (brute-force) implementation for small n, which if nothing else shows the expressiveness of generator comprehensions:

selection = max(
    itertools.combinations(points, k),
    key=lambda candidate: sum(
        dist(p, q) for p, q in itertools.combinations(candidate, 2)
    )
)

Although this ends up calling dist a lot:

(k! / 2! / (k-2)!) * (n! / k! / (n-k)! == n! /(2(k-2)!(n-k)!)
Eric
  • 95,302
  • 53
  • 242
  • 374