2

Need to split a list of positive values into a K groups. The groups should be almost equal in size (number of values), but the values of items in each group have the minimal standard deviation. (Think grouping products by similar pricing).

I was thinking sort the values, then split by order into K groups gets me very close and then only if the groups sizes aren't all equals do one-pass to see if the extra items moved from one group to the next creates a better result. But I feel like I'm missing something big here and it should be more complicated than that.

dean734
  • 21
  • 2
  • 1
    a.) what have you tried so far? can you show any code? b.) you tagged both java and javascript. which one is it? – NullDev Mar 14 '23 at 14:34
  • 1
    llook up https://en.wikipedia.org/wiki/K-means_clustering ;-) – MBobrik Mar 14 '23 at 14:35
  • I removed JS code though both would work for us. I didn't write any code yet as I feel I'm missing something very basic and it's not as simple as I make it. Will write my solution shortly if it helps. – dean734 Mar 14 '23 at 14:36
  • @MBobrik Yes, I've seen that which is why I feel I'm missing something basic here with my solution which is far from np-hard... – dean734 Mar 14 '23 at 14:40
  • 1
    quote "The problem is computationally difficult (NP-hard); however, *efficient heuristic algorithms converge quickly to a local optimum*." ... so you basically got one of those heuristic solutions. – MBobrik Mar 14 '23 at 14:43
  • @MBobrik No, in one dimension this is NOT NP-hard. In 2 dimensions it is. – btilly Mar 14 '23 at 18:07
  • @btilly to be honest, never thought about the 1D special case in detail - but never mind, others did - just found this one https://stackoverflow.com/questions/11513484/1d-number-array-clustering – MBobrik Mar 14 '23 at 21:12

2 Answers2

1

You are missing a complication.

Suppose that k is even and n = 2.5*k. Then half your groups are of size 2 and half of size 3. With a naïve approach, you have to consider the k choose 2 = O(2^k / sqrt(k)) possible ways to decide which are of size 2 and which are of size 3. This will take exponential time.

Let m be the group size, rounded down. Then the optimal division of the first i items into j groups will be to have either m elements in the last group and divide the first i-m items into j-1 groups, or to have m+1 elements in the last group and divide the first i-m-1 items into j-1 groups. You can calculate that with a recursive function which returns a pair for (cost, linkedListOfChoices).

Add a caching layer to that recursion and it will run in polynomial time. This is called top-down dynamic programming.

Going from your optimal linkedListOfChoices for dividing the whole list into k groups to the actual groups should be straightforward. Here is an example.

Suppose we want to divide [0.1, 0.2, 0.4, 0.5, 0.6, 0.8, 0.9, 1.0, 2.1, 2.3] into 4 groups. You should get a linked list corresponding to (but maybe with a very different representation):

[8, 9,
    [5, 7,
        [2, 4,
            [0, 1,
                null]]]]]

From which your groups (read inside out, turn indices back into numbers) are:

[[0.1, 0.2], [0.4, 0.5, 0.6], [0.8, 0.9, 1.0], [2.1, 2.3]]
btilly
  • 43,296
  • 3
  • 59
  • 88
  • Sorry for being slow, I think I got most of what you wrote and it makes sense but now I'm not sure is the success criteria / minimal "price" for the overall thing that I measure? the optimized solution? is it summing up all the deviation values for each group and looking for the minimal sum? – dean734 Mar 15 '23 at 03:41
  • @dean734 Exactly. Minimum sum of deviations so far, and the trail of decisions that got you there. – btilly Mar 15 '23 at 05:47
0

The idea of sorting first is correct. Then you have to create K groups from the sorted array, which you can do as following: If there is any deviation you can play around the "border elements" of each group to make the result better.

For example: N=16, K=3. Now there are some possible outcomes. First group 5 memebers, second group 6 members and last group again 5 members, but there are other possibilties to create your groups. Now you need to find and evaluate all the possibilities, and pick out the best one. For example, if shape [5,5,6] is better than shape [5,6,5] you move the last element ("border element") from the second group into the third group.

If there is no deviation at all, for example N=100, K=10, there is only one possibility you can choose, and that is the best one in this case.

Animalz
  • 90
  • 8