2

I don't know if this a well known problem or is reducible to one, but here is what I am struggling with since last few days.

Let X_1, X_2, X_3 .. X_n be sets of integers.

In each set X_i, we have m (equal for all sets) elements x_i,j : j = 1...m.

My goal is to pick a single element from each set, such that the sum of pairwise differences between all the selected elements is least possible.

I started off with a slightly different problem of minimizing the sum of serial-differences (that is, sum of differences between elements selected from consecutive sets).The forward computation in DP is as follows:

Let F(i, j) be the cost to select element j from set i.

F(i, j) = min_k F(i - 1, k) + |x(i-1,k) - x(i, j)|

That is, we select k-th element in the previous set and then evaluate the cost of selecting j-th element in the current set. The cost of current step is equal to the absolute difference between the k-th element in set i-1 and j-th element in set i.

However, my real problem does not depend on the order of the sets X_1, ... X_n. In essence, the dynamic programming is giving me something, but as I said, it is a related but different problem of finding the "chain" of elements, one from each set, such that the sum of "serial" differences is minimized.

From a basic analysis, I see that a naive solution to this problem would be to generate all permutations of the n sets and then solve it using dynamic programming for each permutation. This is intractable, but when n is small enough - we can even take this dumb exhaustive approach.

I am unable to figure out if this problem can be solved using a polynomial algorithm at all, or if it can be reduced to one of the known NP-hard/complete problems in which case, I will go for an approximate algorithm by modeling it as a quadratic program.

Please advise me or point me to some reading material.

Thanks.

[Edit 1]

Adding an example here for the convenience of discussion:

X = [[11, 24, 38], [12, 21, 33], [13, 22], [15, 23]]

The solution is 24, 21, 22, 23 (possibly irrelevant, but my DP gives me 11, 12, 13, 15, I have constructed this example particularly to fail my DP.).

[Edit 2]

I guess this is not an NP-complete problem. The solution that we can expand from the DP is as follows [Not sure if correct, but seems like it]:

The solution contains at least one element from each set.

So, let us select any set (preferably smallest, if the sizes vary), and remove it from the list.

Lets call it X\Xp where Xp is the set removed.

Now for each element x in \Xp, construct a new set X' = X\Xp U {x}.

We solve the DP m-times and compute the objective function (sum of pairwise distances) and then we can pick the best of them.

The DP itself takes O(nm^2) and we run this m-times, so we have O(nm^3).

user1669710
  • 224
  • 1
  • 11
  • 1
    Nice question. (+1) Smells NP-Hardish, but I cannot think of a reduction at the moment. Hope someone will come up with one (or alternatively a polynomial solution) – amit Jul 13 '13 at 07:32
  • thanks; unfortuntely it looks like it is an np-complete! fortunately, SO is there to let you know of that! – user1669710 Jul 13 '13 at 08:30
  • I think this is not NP-Hard and could be solved in at most O(n*n*m) time: just take one smallest element from each set and insert to some container, then get a median of this container and continuously substitute element of some set to the next one so that its distance to median decreases most (or increases least) comparing to other sets; after each iteration compare the resulting sum of distances with the best-so-far result and update it if necessary. I've no proof of optimality. Also I suspect that with proper data structure this may be improved to O(n*log(n)*m). – Evgeny Kluev Jul 13 '13 at 09:26
  • @EvgenyKluev Applying this to my example, I pick [11, 12, 13, 15]. Median being 12.5; I did not understand from this point - because this is the set which has least distance to the median, and replacing with anything else will only increase the objective. Please let me know if I got you correctly. – user1669710 Jul 13 '13 at 18:58
  • First you replace 12 with the next element of the same set (21), which gives sum of distances to median 12: [11,13,15,21]. Then you replace other values: [11,15,21,22], then [15,21,22,24], then **[21,22,23,24]**, and finally [22,23,24,33]. One of these selections gives optimal result. As for the proof of optimality, I think proof by induction should work here, but it would be pretty complicated. – Evgeny Kluev Jul 13 '13 at 19:28
  • still working on checking this; thanks for the response. – user1669710 Jul 13 '13 at 23:10

2 Answers2

1

This can reduce to find a minimum weight clique in a weighted graph. Any two nodes in the different sets have a weight equal to the absolute value of their difference.

xjchengo
  • 9
  • 1
  • 3
  • Thanks. Seems like it. Please see if I understand it correctly: Given 3 sets X1, X2, X3 with m elements each, we build a 3m x 3m adj. matrix, where the block diagonals of m x m (corresponding to sets) are infinities, and between any two elements selected from different sets, the weight is equal to the abs. difference between the elements? In other words, make all possible cliques by selecting one element from each set? – user1669710 Jul 13 '13 at 08:28
  • 1
    To show a problem is NP-Complete, you need to show a reduction from an NP-Hard problem to your problem, not the other way around. Remember that every problem in NP has a reduction **to** SAT. That does not mean every problem in NP is NP-Hard. – amit Jul 13 '13 at 08:33
  • @user1669710 Yes, what i think is same with you. – xjchengo Jul 13 '13 at 08:46
  • @xjchengo +1 for the insight you had towards the connection; I am still breaking my head on reducing max-clique to this. – user1669710 Jul 13 '13 at 18:59
  • turns out i can do it in O(nm^3) - posted my solution above as an edit. not sure if it is correct though, I just put it out there. – user1669710 Jul 15 '13 at 22:15
1

I don't think this is NP-complete, because I think we can find and recognize a solution with at most O(n^2m^2) guesses.

So as not to worry about ties, go from integers to reals and add a very small amount of jitter. I think of this as random, but I think you could chose deterministic jitter to achieve the same effect.

Think of the problem as that of choosing from a collection of coloured points laid out on the real line so as to pick one point of each colour and minimise the sum of absolute differences between them.

I consider only the case when n is even. The median of the set of solution points occurs between the two central points. For each point in the solution set, there is no point of the same colour between it and the two centre points (or we could improve the solution). For the same reason, each point is the closest point of that colour to the median of the n-1 points we get if we remove it from the solution set.

Using our (nm)^2 guesses we guess the identity of the two central points in the solution set. This leaves us n-2 points to find, and we can reduce the possibilities to two of each colour, the closest point to the two central points on each side of those two points.

Now consider the median formed by removing one point from the solution set. If the point we remove is on the right side of the two central points, the median is the left of those two points. If the point we remove is on the left side, the median is the right of those two points. In a solution set, the point we have just removed is closer to the new median than any other point of that colour, and that new median is the further of the two central points from it. So we can distinguish it from the other candidate of the same colour - it is whichever of the two is closest to the further of the two central points.

Therefore by making at most O(n^2*m^2) guesses we can find a possible solution set for each get, and from these solution sets we choose the one with smallest objective to get the global minimum. Each guess takes some work - maybe O(m) so this may well be an O(n^2m^3) with a completely naive implementation - perhaps mostly a theoretical approach.

Perhaps this is too good to be true, but can we turn this into a proof of simply checking each point in the data together with the point of each other colour closest to it? An argument would be that if we have two points and we can recognize each point in the solution as closest to the furthest of the two points then this point must also be closest to the other point in the pair. So guessing either point in the pair and finding the closest points to it might work. This begins to look a lot like a proof of Evgeny Kluev's solution

mcdowella
  • 19,301
  • 2
  • 19
  • 25
  • Thanks. Checking it out now. I agree with the O(n^2m^2 part). I am working on understanding your solution. But, I am happy because I have another solution which seems like O(n^2m^2) too. I will add it as an edit. – user1669710 Jul 15 '13 at 20:05
  • mine turned out O(nm^3), but may be I am wrong. still working out your soln. – user1669710 Jul 15 '13 at 22:16
  • I haven't really considered the order except to show it is polynomial. I have counted guesses and each guess takes perhaps O(m) work but I haven't considered more efficient ways of implementing it. Perhaps it is too good to be true but I have edited my answer to suggest that you might just need to guess one point, in which case it looks a lot like Evgeny Kluev's solution - try each point together with one point of each other colour closest to it as a possible solution. – mcdowella Jul 16 '13 at 04:37