Minimum set difference

Question

i came across this question on this website called codility, but i cant really figure out how to solve it, would appreciate the help

Given an array A of n integers, and the sequence S of n elements 1 or -1 we define the value:

enter image description here

Assume the sum of zero elements is equal zero. Write a function

int min_abs_sum(int[] A);

than given an array A of n integers from the range [-100..100] computes the lowest possible value of val(A,S) (for any sequence S with elements 1 or -1). You can assume that n<=20000 .

For example given array: a={1,5,2,-2}

your function should return 0, since for sequence S=(-1,1,-1,1) the val(A,S)=0.

Here are two links for some peoples result, it doesnt show the solution but it does show the complexity of their algorithms, the first link shows the complexity at which the program should run and the second one is slower.

1st link 100% marks

2nd link 86% marks

Can you explain more detailed what is val(X, Y) ? If I understand correctly, val(X,Y) tries to find the linear combination of X and an Y made of 1 and -1 that gives the closest value to 0... — Victor Hurdugaci, Apr 19 '11 at 14:25
i have edited the question i think it should be more clear now — yahh, Apr 19 '11 at 14:27
Can the function return negative values? Because if S = `{-1,-1,-1,1}` the return value would be `-10` which is less than `0`.... Ah, but the ABS, never mind, the edit clears that up — joe_coolish, Apr 19 '11 at 14:30
Codility specifying that n <= 20000 might be an important detail; Some algorithms might not scale to arbitrarily large values of n, but may work for a smaller array of inputs. — Luke Hutteman, Apr 19 '11 at 14:58

phkahler · Accepted Answer · 2011-04-19T19:41:43.503

9

This is poorly worded version of the partition problem. You are going to split the array A into 2 groups as close to equal as possible. The one with the larger sum you'll be assigning +1 in the S array, and the other group will get -1. Pick a solution to the partition problem and adjust it to return an answer to this one. Actually it's the variant of partition that seeks the best possible value as opposed to 2 equal sets.

EDIT here is some python code based on the paper linked by @Jerry Coffin

def min_abs_sum(A):
vals = []
for x in A:
    for v in vals:
        n = v+x
        if (abs(n)<=1000000) and (n not in vals): vals.append(n)
        n = v-x
        if (abs(n)<=1000000) and (n not in vals): vals.append(n)
    if (x not in vals): vals.append(x)
    if (-x not in vals): vals.append(-x)
return (min([abs(x) for x in vals]))

The one million value is half of 20000 (max numbers in A) times 100/2. I've used a list instead of an array, which means some things will be faster and some slower than what they do in the paper. It is conceivable that the min is achieved by summing the first half of the numbers and subtracting the second half - or something like that which requires large intermediate sums. I'm using a list rather than an array, but the size is still bounded. Sorry, I don't do Java.

edited Apr 19 '11 at 19:41

answered Apr 19 '11 at 14:51

phkahler

5,687
1
23
31

Ah, thank you. I was carefully reducing the Knapsack Problem to this, but the Partition Problem is more applicable. – David Thornley Apr 19 '11 at 14:55
+1, good call. I was thinking knapsack too, but partition is a better way of thinking about the problem – joe_coolish Apr 19 '11 at 15:06
@phkahler could you provide a programmatic solution to the problem – yahh Apr 19 '11 at 15:23
The first sentence is a borderline -1. Why is it a "poorly worded" version of that partition problem? Also this answer is ignoring the crucial fact that the numbers are in limited range, which gives a nice (linear time I think) dynamic programming algorithm. See Jerry's answer. – Apr 19 '11 at 18:31
@Moron: Because at the time I wrote my reply, there were a bunch of people commenting on it and none had identified it as such - see 2 responses to this answer alone that were thinking knapsack. It was also unclear that an algorithm had to find S as well as the resulting val(A,S) i.e. that S is not given. If I had this as an interview question, I would respond with "are all your requirements this bad?" but that's just my opinion of course. – phkahler Apr 19 '11 at 18:49
It is clear that you don't have to find S, but the minimum possible value of val(S,A). The interview-question tag was given by OP. The fact that this appears on codility should make it clear that this is likely a programming contest type of problem (and not a face to face interview question). If the range was not limited, then this is NP-Hard, by reducing to partition, but the solution does not end there (because of the limited range), but your answer does. Even if it was a face to face question, the attitude surely won't get you the job :-) Anyway... – Apr 19 '11 at 18:52
@phkahler no need to track sums up to 1000000, 200 is enough. There is no reason to increase sum beyond this value. See my answer at bottom. – blaze Apr 21 '11 at 16:02

Jerry Coffin · Answer 2 · 2011-04-19T18:29:15.933

This basically works out to partitioning a into two pieces with the sums of the absolute values of the two pieces as close to equal as possible.

You then want to multiply those elements by 1 or -1 to make one partition all negative and the other partition all positive. As you do that, you sum them to get the final answer.

From an algorithmic viewpoint, I believe the partitioning step is almost certainly NP-completely (phrases like "subset sum" and "partition problem" come to mind). From a programming viewpoint, it's pretty simple though -- exhaustively test possibilities until you get the best one. As long as the number of element is small (up to a dozen or so [edit: since it's O(2^N, you could probably increase that to somewhere in the 30-40 range) it'll be reasonably fast.

~~I believe it should be proportional to O(N!) though, so if the array gets at all large, the time taken will quickly become unreasonable.~~Since you're only dividing into two sets and order within sets doesn't matter, it's O(2^N) instead of O(N!). This doesn't grow nearly as quickly as O(N!), but still quickly enough to make large sets unreasonable to process.

I should add, however, that Codility seems to specialize in problems that may initially appear to be NP-complete, but really aren't -- if you've missed any detail in your description, the problem may be substantially easier.

Edit: rereading it, the problem may be that I ignored a crucial detail: the restricted range. I'm not sure how you use it offhand, but I'm pretty sure it's crucial to producing an efficient solution. My immediate guess is that it's based on something similar to changing a comparison-based sort to a counting (aka bucket) sort. I haven't thought through it in any real detail though...

Edit2: Doing a bit of looking (and being prompted by @Moron), the limited range is important, and my thinking about how it figures into a solution was generally correct. @Moron was kind enough to point to the Wikipedia entry for the subset sum problem, but I didn't find that particularly well written. A bit of looking turned up a paper from Cornell with an explanation I found a bit cleaner/more understandable.

i havent missed any part of the question, if u click on the links provided you can read the question on their website — yahh, Apr 19 '11 at 14:55
It is Theta(2^n), not Theta(n!). Horowitz Sahni's algorithm allows O(n2^(n/2)). See this previous question: http://stackoverflow.com/questions/3121000/generate-all-subset-sums-within-a-range-faster-than-okn-2n-2. And yes, the restricted ranges makes this easier (see the link in my answer). If you add that to your answer, I can delete my answer. — , Apr 19 '11 at 17:41

Pablo · Answer 3 · 2013-06-26T20:06:15.580

The following Java solution will score 100% in Codility. My solution is based on the 'Partition with Duplicate Elements' section of the paper linked by @Jerry Coffin but I also incorporated several additional optimizations.

import java.util.Arrays;

class Solution {

    public int solution ( int[] A ) {

        int n=A.length,r=0,c=1,sum=0,mid=0;

        // Add all numbers, replace them with their absolute value, and sort them
        for(int i=0;i<n;i++) {
          A[i]=Math.abs(A[i]);
          sum+=A[i];
        }
        Arrays.sort(A); // This minimizes the speed of growth of r in the loop below and allows us to count duplicates while scanning the array
        mid=sum/2; // If the number is odd the result is rounded down (the best possible solution is 1 instead of 0).

        // Find the subset of numbers whose sum is closest to half the total sum    
        boolean[] bs=new boolean[mid+101]; // We only need to check sets that are equal or less than half the sum of the numbers (to avoid having to check the sum in the inner loop I sacrifice 100 booleans since 100 is the maximum value allowed)
        bs[0]=true; // The set with zero elements always adds to 0
        for(int i=0;i<n;i++){
          if( A[i]==0 ) continue;
          // Count duplicate entries
          if(i<n-1 && A[i]==A[i+1] ){
            c++;
            continue;
          } 
          // Scan the subset sum result array from right to left and add A[i] c times to existing subset sums
          for (int j = r; j >= 0; j--)
            if(bs[j] ){
              int m= Math.min(mid, j+A[i]*c );
              for(int k= j+A[i];k<=m && !bs[k];k+=A[i] ) bs[k]=true; // To avoid duplicate work when dealing with multiples of previous numbers the loop stops if we find an entry has already been set.
            }
          r=Math.min(mid, r+A[i]*c); // New rightmost subset sum can be no more than the mid point
          while(!bs[r]) r--; // Scan back to rightmost subset sum
          if(r==mid) break; // Found an optimal solution; no need to continue
          c=1;
    }
    return sum-2*r; // The rightmost subset sum that does not go over half the sum is the best solution, compute the difference of the complementary subsets (r and sum-r).
  }

}

joe_coolish · Answer 4 · 2011-04-19T15:08:12.813

So, the goal would be to get as close to 0 as possible.

My first thought was that I would sort the array in descending order, then iterate over the list doing something like this:

int total = 0;
foreach(int i in a)
{
    total = Math.Min(Math.Abs(total - i), Math.Abs(total + i));
}

which would work for a={1,5,2,-2} (total will be the following 5,4,2,0)

But I'm not sure that it works in all cases. I'll look into it a bit more then see if there is a case that it doesn't work for.

EDIT:

Ok, I guess the brute force will work?

    public static int MinArray(int[] array)
    {
        int val = int.MaxValue;

        for (int i = 0; i < Math.Pow(2, array.Length); i++)
        {
            val = Math.Min(CalcMin(array, i), val);
        }

        return val;
    }

    private static int CalcMin(int[] array, int negs)
    {
        int ret = 0;

        for (int i = 0; i < array.Length; i++)
        {
            int neg = 1;

            if (negs != 0)
            {
                neg = negs % 2 == 1 ? -1 : 1;
                negs >>= 1;
            }
            ret += array[i] * neg;
        }

        return Math.Abs(ret);
    }

So, what I'm doing is taking every iteration of S (which is calculated by taking the binary of i in the MinArray) and finding the min that way.

With a little bit of modification, you could also get the correct value for S (if that is a requirement. If not, making it a requirement might score you some points on the interview?)

i think there should be a more sophisticated solution than brute force, if you check the links it shows the complexity at which it should run — yahh, Apr 19 '11 at 15:10
Ah, I didn't see the update with the links. I'll take a look — joe_coolish, Apr 19 '11 at 15:13

blaze · Answer 5 · 2011-05-18T14:31:02.077

0

This could possibly work fast:

maxvalue = 100

def solve(data):
    def mark_sum(s):
        # wrap sum around maxvalue
        if s >= maxvalue:
            s -= maxvalue * 2
        elif sum < -maxvalue:
            s += maxvalue * 2
        # mark sum
        if s >= 0:
            s_new_pos[s] = True
        else:
            s_new_neg[s + maxvalue] = True

    s_old_pos = [False] * maxvalue  # marks for sums [0..99]
    s_old_neg = [False] * maxvalue  # marks for sums [-100..-1]
    s_old_pos[0] = True  # seed array with zero sum for zero elements
    for n in data:
        s_new_pos = [False] * maxvalue
        s_new_neg = [False] * maxvalue
        for i in range(maxvalue):   # mark new possible sums
            if s_old_pos[i]:
                mark_sum(i + n)
                mark_sum(i - n)
            if s_old_neg[i]:
                mark_sum(i - 100 + n)
                mark_sum(i - 100 - n)
        s_old_pos = s_new_pos
        s_old_neg = s_new_neg
    for i in range(maxvalue):
        if s_old_pos[i]:
            return i
        if s_old_neg[-1 - i]:
            return abs(-1 - i)
    raise AssertionError('my bad')

No need to check for all possible sums (up to 1000000). They can be just wrapped around max_value. This replaces n with max_value in time complexity.

Still unsure about correctness :(

edited May 18 '11 at 14:31

answered Apr 21 '11 at 16:00

blaze

4,326
18
23

This is wrong. >>> solve([89,90,91,54,54,54,54,54]) 34 but the answer is 0, add first 3, subtract last 5. I think the general approach is right (make alg fast by restricting max sum), but I am not sure of an appriopriate bound on max sum. Also, I suspect that if you randomize the input order and run your algorithm a few times (perhaps with maxsum == 2000 instead of 200), it will likely be correct. I have a hunch that max sum = 100 * 100 will always work, but I can't prove it. – Rob Neuhaus May 18 '11 at 07:27
maxsum = n * max_value will definitely work, but give n*(n*max_value) complexity for 86% points. Can't find anything for n * max_value * max_value for 100% :( – blaze May 18 '11 at 13:46
@rrenaud: i've updated my answer, but still unsure about it's correctness – blaze May 18 '11 at 14:31
Is there a way for me to submit an answer? Really, I think you should use your first algorithm, except change max_num to 10,000 and do a random shuffle of data before applying the algorithm. Then the chance you get a wrong answer is related to the expected translation of a random walk in 1D. http://en.wikipedia.org/wiki/Random_walk. Play with different maxsum here, and you'll see that usually 2000 works. [100]*99 + [99]*100 is the hardest input I can come up with for the bounded max sum style alg. http://codepad.org/DPGyc3rI – Rob Neuhaus May 18 '11 at 20:36

zhizhong · Answer 6 · 2011-05-24T17:07:01.497

def min_abs_sum(A):
    A[:] = sorted([ abs(i) for i in A if i != 0 ], reverse=True)
    s = sum(A)
    h = s / 2
    r = find_balance_iter(h, A)
    return abs(2*(h-r) - s)

def find_balance_iter(v, A):
    r = v
    n = len(A)
    for i in xrange(n):
        if i and A[i] == A[i-1]:
            continue
        for j in xrange(n-i-1):
            vv = v - A[i]
            rr = vv
            AA = A[i+j+1:]
            while True:
                if vv == 0 or vv in AA:
                    return 0
                if vv < 0 or not AA:
                    rr = vv
                    break
                if vv < AA[-1]:
                    rr = min(vv-AA[-1], rr, key=compare)
                    break
                vv -= AA[0]
                AA[:] = AA[1:]
            r = min(r, rr, key=compare)
    return r

def compare(a):
    return (abs(a), a)

This doesn't work. You say min_abs_sum([8,10,6,6,6]) = 4, but really it's 0. — Rob Neuhaus, May 18 '11 at 07:18

Minimum set difference

6 Answers6

Linked