How to generate all set combinations in a random order

Question

First off, I'm not even sure the terminology is the right one, as I havent found anything similar (especially since I dont even know what keywords to use)

The problem: There is a population of people, and I want to assign them into groups. I have a set of rules to give each assignation a score. I want to find the best one (or at least a very good one).

For example, with a population of four {A,B,C,D} and assigning to two groups of two, the possible assignations are:

{A,B},{C,D}

{A,C},{B,D}

{A,D},{B,C}

And for example, {B,A},{C,D} and {C,D},{A,B} are both the same as the first one (I don't care about the order inside the groups and the order of the groups themselves).

The number of people, the amount of groups and how many people fit in each group are all inputs.

My idea was to list each possible assignation, calculate their score and keep track of the best one. That is, to brute force it. As the population can be big, I was thinking of going through them in a random order and return the best one found when time runs out (probably when the user gets bored or thinks it is a good enough find). The population can vary from very small (the four listed) to really big (maybe 200+) so just trying random ones without caring about repeats breaks down with the small ones, where a brute force is possible (plus I wouldn't know when to stop if I used plain random permutations).

The population is big enough that listing all the assignations to be able to shuffle them doesn't fit into memory. So I need either a method to find all the possible assignations in a random order, or a method to, given an index, generate the corresponding assignation, and use an index array and shuffle that (the second would be better because I can then easily distribute the tasks into multiple servers).

The shuffle-approach only needs one shuffle for each iteration. Your problem is not well formulated, but what about the following: for each iteration: shuffle your population and assign them linearily. So the first two are group 0, the next two are group 1 and so on. This will only need O(population-size) space per iteration. The chance of obtaining the same shuffle (wasting an iteration) is very low. If group-sizes are non-fixed, nothing needs to be changed. Just put first x into group0, next y elements into group1. An efficient algorithm though will always use information about the costs — sascha, Aug 30 '16 at 23:32
After re-reading your question: the above approach does not care about the order-properties described (```{B,A},{C,D} and {C,D},{A,B}``` can both be sampled), but i suspect, that the overall effiency of this approach is still good (because the probability of these events is low). The approach is also very simple. But every efficient algorithm depends on your parameters and the cost-function, so it might be wise to improve your question! — sascha, Aug 30 '16 at 23:36
I havent mentioned it in the question (I will add it now), but the population varies from very small (maybe even the four people example) to quite big (for now maybe 200, which has more permutations than a long can store) so a pure randomization doesent work for the small ones where brute force finishes quite quickly. — Daniferrito, Aug 31 '16 at 00:04
Tell us something about the cost-function. It's kind of dumb to search blindly (black-box optimization is very hard). — sascha, Aug 31 '16 at 00:17
Is this question about how to generate sets, or is it about the larger question of how to solve your optimization problem? — samgak, Aug 31 '16 at 00:39
I'm not sure it is right to add that (it would be a second question), but each person has four values (something like height, weight,...), and i want the average of each value within each group to be as close to the average of the whole population, with some values more important than others. Basically, i want all groups to be average in all categories and none to stand up over the others. — Daniferrito, Aug 31 '16 at 00:44
@samgak you are right, i should split this question into two, one based on how to generate sets and other about how to aproach the optimization problem — Daniferrito, Aug 31 '16 at 00:46
With **your cost-function description**: this could be formulated as **Mixed-integer-second-order-cone problem** (there are not many solvers; but Mosek/commercial and Pajarito/open-source can tackle this; also the much more general mixed-integer nonlinear solvers like Knitro/Couenne/Bonmin). A **well-formulated approach will beat the blind-approach and also tuned metaheuristics easily and can provide proofs of bounds**. I'm also having problems with the approach of your question: why not asking for a general-approach, but asking for a sub-step of a very naive one? — sascha, Aug 31 '16 at 01:18
I forgot to add, that the **MISOCP**-approach would be natural approach, when optimizing the squared error / L2-norm. L1-norm minimization should be possible to formulate as classic **MIP** where are much more (and much faster) solvers available, including the excellent open-source solver [cbc](https://projects.coin-or.org/Cbc). — sascha, Aug 31 '16 at 01:24

score 2 · Answer 1 · edited May 23 '17 at 12:22

A simple recursive algorithm for generating these pairings is to pair the first element with each of the remaining elements, and for each of those couplings, recursively generate all the pairings of the remaining elements. For groups, generate all the groups made up of the first element and all the combinations of the remaining elements, then recurse for the remainders.

You can compute how many possible sets of groups there are like this:

public static int numGroupingCombinations(int n, int groupSize)
{
    if(n % groupSize != 0)
        return 0;   // n must be a multiple of groupSize

    int count = 1;
    while(n > groupSize)
    {
        count *= nCr(n - 1, groupSize - 1);
        n -= groupSize;
    }
    return count;
}

public static int nCr(int n, int r)
{
    int ret = 1;
    for (int k = 0; k < r; k++) {
        ret = ret * (n-k) / (k+1);
    }
    return ret; 
}

So I need either a method to find all the possible assignations in a random order, or a method to, given an index, generate the corresponding assignation, and use an index array and shuffle that (the second would be better because I can then easily distribute the tasks into multiple servers).

To generate a grouping from an index, choose a combination of items to group with the first element by taking the modulo of the index with the number of possible combinations, and generating the combination from the result using this algorithm. Then divide the index by that same number and recursively generate the rest of the set.

public static void generateGrouping(String[] elements, int groupSize, int start, int index)
{
    if(elements.length % groupSize != 0)
        return;

    int remainingSize = elements.length - start;
    if(remainingSize == 0)
    {
        // output the elements:
        for(int i = 0; i < elements.length; i += groupSize)
        {
            System.out.print("[");
            for(int j = 0; j < groupSize; j++)
                System.out.print(((j==0)?"":",")+elements[i+j]);
            System.out.print("]");
        }
        System.out.println("");
        return; 
    }

    int combinations = nCr(remainingSize - 1, groupSize - 1);

    // decide which combination of remaining elements to pair the first element with:
    int[] combination = getKthCombination(remainingSize - 1, groupSize - 1, index % combinations);

    // swap elements into place
    for(int i = 0; i < groupSize - 1; i++)
    {
        String temp = elements[start + 1 + i];
        elements[start + 1 + i] = elements[start + 1 + combination[i]];
        elements[start + 1 + combination[i]] = temp;
    }

    generateGrouping(elements, groupSize, start + groupSize, index / combinations);

    // swap them back:
    for(int i = groupSize - 2; i >= 0; i--)
    {
        String temp = elements[start + 1 + i];
        elements[start + 1 + i] = elements[start + 1 + combination[i]];
        elements[start + 1 + combination[i]] = temp;
    }
}

public static void getKthCombination(int n, int r, int k, int[] c, int start, int offset)
{
    if(r == 0)
        return;
    if(r == n)
    {
        for(int i = 0; i < r; i++)
            c[start + i] = i + offset;
        return;
    }
    int count = nCr(n - 1, r - 1);
    if(k < count)
    {
        c[start] = offset;
        getKthCombination(n-1, r-1, k, c, start + 1, offset + 1);
        return;
    }
    getKthCombination(n-1, r, k-count, c, start, offset + 1);
}

public static int[] getKthCombination(int n, int r, int k)
{
    int[] c = new int[r];
    getKthCombination(n, r, k, c, 0, 0);

    return c;
}

Demo

The start parameter is just how far along the list you are, so pass zero when calling the function at the top level. The function could easily be rewritten to be iterative. You could also pass an array of indices instead of an array of objects that you want to group, if swapping the objects is a large overhead.

The amount of people in each group can vary (they are not limited to groups of two). I assume this algorithm can be made to work with bigger groups. Is that right? If it is, this seems the perfect answer. — Daniferrito, Aug 31 '16 at 00:14
Yes, it can. For example for groups of 3 you would group the first element with each possible combination of other 2 elements, modulo/divide the index by ((remainingSize - 1) * (remainingSize - 2))/2, and increment start by 3 when you recurse. — samgak, Aug 31 '16 at 00:18
To write a general algorithm that handles arbitrary group size, you need to count and generate the nth combination of the remaining element at each step. You can use the algorithm here: [nth combination](http://stackoverflow.com/questions/1776442/nth-combination) — samgak, Aug 31 '16 at 00:50

score 2 · Answer 2 · edited May 23 '17 at 12:08

Let's say we have a total of N elements that we want to organize in G groups of E (with G*E = N). Neither the order of the groups nor the order of the elements within groups matter. The end goal is to produce every solution in a random order, knowing that we cannot store every solution at once.

First, let's think about how to produce one solution. Since order doesn't matter, we can normalize any solution by sorting the elements within groups as well as the groups themselves, by their first element.

For instance, if we consider the population {A, B, C, D}, with N = 4, G = 2, E = 2, then the solution {B,D}, {C,A} can be normalized as {A,C}, {B,D}. The elements are sorted within each group (A before C), and the groups are sorted (A before B).

When the solutions are normalized, the first element of the first group is always the first element of the population. The second element is one of the N-1 remaining, the third element is one of the N-2 remaining, and so on, except these elements must remain sorted. So there are (N-1)!/((N-E)!*(E-1)!) possibilities for the first group.

Similarly, the first element of the next groups are fixed : they are the first of the remaining elements after each group has been created. Thus, the number of possibilities for the (n+1)th group (n from 0 to G-1) is (N-nE-1)!/((N-(n+1)E)!*(E-1)!) = ((G-n)E-1)!/(((G-n-1)E)!*(E-1)!).

This gives us one possible way of indexing a solution. The index is not a single integer, but rather an array of G integers, the integer n (still from 0 to G-1) being in the range 1 to (N-nE-1)!/((N-nE-E)!*(E-1)!), and representing the group n (or "(n+1)th group") of the solution. This is easy to produce randomly and to check for duplicates.

The last thing we need to find is a way to produce a group from a corresponding integer, n. We need to choose E-1 elements from the N-nE-1 remaining. At this point, you can imagine listing every combination and choosing the (n+1)th one. Of course, this can be done without generating every combination : see this question.

For curiosity, the total number of solutions is (GE)!/(G!*(E!)^G).
In your example, it is (2*2)!/(2!*(2!)^2) = 3.
For N = 200 and E = 2, there are 6.7e186 solutions.
For N = 200 and E = 5, there are 6.6e243 solutions (the maximum I found for 200 elements).

Additionally, for N = 200 and E > 13, the number of possibilities for the first group is greater than 2^64 (so it cannot be stored in a 64-bit integer), which is problematic for representing an index. But as long as you don't need groups with more than 13 elements, you can use arrays of 64-bit integers as indices.

score 2 · Answer 3 · edited May 23 '17 at 12:08

What you call "assignations" are partitions with a fixed number of equally sized parts. Well, mostly. You didn't specify what should happen if (# of groups) * (size of each group) is less than or greater than your population size.

Generating every possible partition in a non-specific order is not too difficult, but it is only good for small populations or for filtering and finding any partition that matches some independent criteria. If you need to optimize or minimize something, you'll end up looking at the whole set of partitions, which may not be feasible.

Based on the description of your actual problem, you want to read up on local search and optimization algorithms, of which the aforementioned simulated annealing is one such technique.

With all that said, here is a simple recursive Python function that generates fixed-length partitions with equal-sized parts in no particular order. It is a specialization of my answer to a similar partition problem, and that answer is itself a specialization of this answer. It should be fairly straightforward to translate into JavaScript (with ES6 generators).

def special_partitions(population, num_groups, group_size):
    """Yields all partitions with a fixed number of equally sized parts.

    Each yielded partition is a list of length `num_groups`, 
    and each part a tuple of length `group_size.
    """
    assert len(population) == num_groups * group_size
    groups = []  # a list of lists, currently empty 

    def assign(i):
        if i >= len(population): 
            yield list(map(tuple, groups))
        else:
            # try to assign to an existing group, if possible
            for group in groups:
                if len(group) < group_size:
                    group.append(population[i])
                    yield from assign(i + 1)
                    group.pop()

            # assign to an entirely new group, if possible
            if len(groups) < num_groups:
                groups.append([population[i]])
                yield from assign(i + 1)
                groups.pop()

    yield from assign(0)

for partition in special_partitions('ABCD', 2, 2):
    print(partition)

print()

for partition in special_partitions('ABCDEF', 2, 3):
    print(partition)

When executed, this prints:

[('A', 'B'), ('C', 'D')]
[('A', 'C'), ('B', 'D')]
[('A', 'D'), ('B', 'C')]

[('A', 'B', 'C'), ('D', 'E', 'F')]
[('A', 'B', 'D'), ('C', 'E', 'F')]
[('A', 'B', 'E'), ('C', 'D', 'F')]
[('A', 'B', 'F'), ('C', 'D', 'E')]
[('A', 'C', 'D'), ('B', 'E', 'F')]
[('A', 'C', 'E'), ('B', 'D', 'F')]
[('A', 'C', 'F'), ('B', 'D', 'E')]
[('A', 'D', 'E'), ('B', 'C', 'F')]
[('A', 'D', 'F'), ('B', 'C', 'E')]
[('A', 'E', 'F'), ('B', 'C', 'D')]

score 1 · Answer 4 · answered Aug 31 '16 at 00:16

Perhaps a simulated annealing approach might work. You can start with a non-optimal initial solution and iterate using heuristics to improve.

Your scoring criteria may help you choose the initial solution, e.g. make the best scoring first group you can, then with what's left make the best scoring second group, and so on.

Good choices of "neighboring states" may be implied by your scoring criteria, but at the very least, you could consider two states neighboring if they differ by a single swap.

So the iteration portion of the algorithm would be to try a bunch of swaps, sampled randomly, and choose the one that improves the global score according to the annealing schedule.

I'm hoping you can find a better choice of the adjacent states! That is, I'm hoping you can find better rules for iteratively improving based on your scoring criteria.

I definitely would prefer your approach compared to the exhaustive generation and scoring. It's also possible to build some kind of hybrid: just build N different start-positions and use SA with 1/N iterations (kind of a restart-strategy used in many combinatorial algorithms: e.g. all kmeans-implementations do this). — sascha, Aug 31 '16 at 00:24

score 0 · Answer 5 · answered Aug 30 '16 at 23:40

If you have a sufficiently large population that you can't fit all assignations in memory and are unlikely to ever test all possible assignations, then the simplest method will just be to choose test assignations randomly. For example:

repeat 
    randomly shuffle population
    put 1st n/2 members of the shuffled pop into assig1 and 2nd n/2 into assig2
    score assignation and record it if best so far
until bored

If you have a large population it is unlikely that there will be much loss of efficiency due to duplicating a test as it is unlikely that you would chance on the same assignation again.

Depending on your scoring rules it may be more efficient to choose the next assignation to be tested by, for example swapping a pair of members between the best assignation so far found, but you haven't provided enough information to determine if that is the case.

This answer does not cover something not mentioned in my comments (which were made earlier) and also treats only the special-case of two assignment-groups. — sascha, Aug 30 '16 at 23:50
I was writing it as you commented. And it would be a simple variation to put the first n/k members into the first group etc to score a k-group assignment. — Penguino, Aug 31 '16 at 01:16

sascha · Answer 6 · 2016-08-31T22:56:45.343

Here is an approach targeting your optimization-problem (and ignoring your permutation-based approach).

I formulate the problem as mixed-integer-problem and use specialized solvers to calculate good solutions.

As your problem is not well-formulated, it might need some modifications. But the general message is: this approach will be hard to beat!.

Code

import numpy as np
from cvxpy import *

""" Parameters """
N_POPULATION = 50
GROUPSIZES = [3, 6, 12, 12, 17]
assert sum(GROUPSIZES) == N_POPULATION
N_GROUPS = len(GROUPSIZES)
OBJ_FACTORS = [0.4, 0.1, 0.15, 0.35]  # age is the most important

""" Create fake data """
age_vector = np.clip(np.random.normal(loc=35.0, scale=10.0, size=N_POPULATION).astype(int), 0, np.inf)
height_vector = np.clip(np.random.normal(loc=180.0, scale=15.0, size=N_POPULATION).astype(int), 0, np.inf)
weight_vector = np.clip(np.random.normal(loc=85, scale=20, size=N_POPULATION).astype(int), 0, np.inf)
skill_vector = np.random.randint(0, 100, N_POPULATION)

""" Calculate a-priori stats """
age_mean, height_mean, weight_mean, skill_mean = np.mean(age_vector), np.mean(height_vector), \
                                                 np.mean(weight_vector), np.mean(skill_vector)


""" Build optimization-model """
# Variables
X = Bool(N_POPULATION, N_GROUPS)  # 1 if part of group
D = Variable(4, N_GROUPS)  # aux-var for deviation-norm

# Constraints
constraints = []

# (1) each person is exactly in one group
for p in range(N_POPULATION):
    constraints.append(sum_entries(X[p, :]) == 1)

# (2) each group has exactly n (a-priori known) members
for g_ind, g_size in enumerate(GROUPSIZES):
    constraints.append(sum_entries(X[:, g_ind]) == g_size)

# Objective: minimize deviation from global-statistics within each group
#   (ugly code; could be improved a lot!)
group_deviations = [[], [], [], []]  # age, height, weight, skill
for g_ind, g_size in enumerate(GROUPSIZES):
    group_deviations[0].append((sum_entries(mul_elemwise(age_vector, X[:, g_ind])) / g_size) - age_mean)
    group_deviations[1].append((sum_entries(mul_elemwise(height_vector, X[:, g_ind])) / g_size) - height_mean)
    group_deviations[2].append((sum_entries(mul_elemwise(weight_vector, X[:, g_ind])) / g_size) - weight_mean)
    group_deviations[3].append((sum_entries(mul_elemwise(skill_vector, X[:, g_ind])) / g_size) - skill_mean)

for i in range(4):
    for g in range(N_GROUPS):
        constraints.append(D[i,g] >= abs(group_deviations[i][g]))

obj_parts = [sum_entries(OBJ_FACTORS[i] * D[i, :]) for i in range(4)]

objective = Minimize(sum(obj_parts))

""" Build optimization-problem & solve """
problem = Problem(objective, constraints)
problem.solve(solver=GUROBI, verbose=True, TimeLimit=120)  # might need to use non-commercial solver here
print('Min-objective: ', problem.value)

""" Evaluate solution """
filled_groups = [[] for g in range(N_GROUPS)]
for g_ind, g_size in enumerate(GROUPSIZES):
    for p in range(N_POPULATION):
        if np.isclose(X[p, g_ind].value, 1.0):
            filled_groups[g_ind].append(p)
for g_ind, g_size in enumerate(GROUPSIZES):
    print('Group: ', g_ind, ' of size: ', g_size)
    print(' ' + str(filled_groups[g_ind]))

group_stats = []
for g in range(N_GROUPS):
    age_mean_in_group = age_vector[filled_groups[g]].mean()
    height_mean_in_group = height_vector[filled_groups[g]].mean()
    weight_mean_in_group = weight_vector[filled_groups[g]].mean()
    skill_mean_in_group = skill_vector[filled_groups[g]].mean()
    group_stats.append((age_mean_in_group, height_mean_in_group, weight_mean_in_group, skill_mean_in_group))
print('group-assignment solution means: ')
for g in range(N_GROUPS):
    print(np.round(group_stats[g], 1))

""" Compare with input """
input_data = np.vstack((age_vector, height_vector, weight_vector, skill_vector))
print('input-means')
print(age_mean, height_mean, weight_mean, skill_mean)
print('input-data')
print(input_data)

Output (time-limit of 2 minutes; commercial solver)

Time limit reached
Best objective 9.612058823514e-01, best bound 4.784117647059e-01, gap 50.2280%
('Min-objective: ', 0.961205882351435)
('Group: ', 0, ' of size: ', 3)
 [16, 20, 27]
('Group: ', 1, ' of size: ', 6)
 [26, 32, 34, 45, 47, 49]
('Group: ', 2, ' of size: ', 12)
 [0, 6, 10, 12, 15, 21, 24, 30, 38, 42, 43, 48]
('Group: ', 3, ' of size: ', 12)
 [2, 3, 13, 17, 19, 22, 23, 25, 31, 36, 37, 40]
('Group: ', 4, ' of size: ', 17)
 [1, 4, 5, 7, 8, 9, 11, 14, 18, 28, 29, 33, 35, 39, 41, 44, 46]
group-assignment solution means: 
[  33.3  179.3   83.7   49. ]
[  33.8  178.2   84.3   49.2]
[  33.9  178.7   83.8   49.1]
[  33.8  179.1   84.1   49.2]
[  34.   179.6   84.7   49. ]
input-means
(33.859999999999999, 179.06, 84.239999999999995, 49.100000000000001)
input-data
[[  22.   35.   28.   32.   41.   26.   25.   37.   32.   26.   36.   36.
    27.   34.   38.   38.   38.   47.   35.   35.   34.   30.   38.   34.
    31.   21.   25.   28.   22.   40.   30.   18.   32.   46.   38.   38.
    49.   20.   53.   32.   49.   44.   44.   42.   29.   39.   21.   36.
    29.   33.]
 [ 161.  158.  177.  195.  197.  206.  169.  182.  182.  198.  165.  185.
   171.  175.  176.  176.  172.  196.  186.  172.  184.  198.  172.  162.
   171.  175.  178.  182.  163.  176.  192.  182.  187.  161.  158.  191.
   182.  164.  178.  174.  197.  156.  176.  196.  170.  197.  192.  171.
   191.  178.]
 [  85.  103.   99.   93.   71.  109.   63.   87.   60.   94.   48.  122.
    56.   84.   69.  162.  104.   71.   92.   97.  101.   66.   58.   69.
    88.   69.   80.   46.   74.   61.   25.   74.   59.   69.  112.   82.
   104.   62.   98.   84.  129.   71.   98.  107.  111.  117.   81.   74.
   110.   64.]
 [  81.   67.   49.   74.   65.   93.   25.    7.   99.   34.   37.    1.
    25.    1.   96.   36.   39.   41.   33.   28.   17.   95.   11.   80.
    27.   78.   97.   91.   77.   88.   29.   54.   16.   67.   26.   13.
    31.   57.   84.    3.   87.    7.   99.   35.   12.   44.   71.   43.
    16.   69.]]

Solution remarks

This solution looks quite nice (regarding mean-deviation) and it only took 2 minutes (we decided on the time-limit a-priori)
We also got tight bounds: 0.961 is our solution; we know it can't be lower than 4.784

Reproducibility

The code uses numpy and cvxpy
An commercial solver was used
- You might need to use a non-commercial MIP-solver (supporting time-limit for early abortion; take current best-solution)
- The valid open-source MIP-solvers supported in cvxpy are: cbc (no chance of setting time-limits for now) and glpk (check the docs for time-limit support)

Model decisions

The code uses L1-norm penalization, which results in an MIP-problem
Depending on your problem, it might be wise to use L2-norm penalization (one big deviation hurts more than many smaller ones), which will result in a harder problem (MIQP / MISOCP)