Find unique compositions of k-distinct-element subsets for a set of n elements

Question

This is an algorithmic problem. If I miss any existing function in Python that helps, please give a shout.

Given a set s of n elements, we can use itertools.combinations() function in Python to find all the unique k-element subsets. Let's call the set containing all these subsets S. Notice that each such subset has k distinct elements.

The problem is 2-step. First, given these k-distinct-element subsets, I want to compose (some of) them such that (a composition is simply a superset of some subsets):

The intersection between any two subsets within a composition is empty
The union between all the subsets in a composition gives exactly the original set s

Second, I want to find the compositions that:

Do not share any subset
Their union gives exactly the S, i.e. the set of all the k-element subsets

As a concrete example here, consider the original set s = {a, b, c, d} and k = 2, we then will have the following three compositions/supersets:

{{a, b}, {c, d}}, {{a, c}, {b, d}}, {{a, d}, {b, c}}

Obviously, the size of s can be large and k >= 2, so an efficient (especially the speed) algorithm is needed here.

P.S. I phrase the problem in 2 steps, but it may well be that an efficient algorithm attacks the problem from a different angle.

So, [exact cover](http://en.wikipedia.org/wiki/Exact_cover) problem, constraining the size of the subsets and finding ALL exact covers. I'm no mathematician but I imagine it's NP-hard. — roippi, Sep 18 '14 at 01:22
FWIW, Ali Assaf has an implementation of Knuth's "Algorithm X" for finding solutions to Exact Cover problems: [Algorithm X in 30 lines!](http://www.cs.mcgill.ca/~aassaf9/python/algorithm_x.html); there's a link on that page to a program that uses that code in a Sudoku solver. It uses set() instead of the linked lists used in Knuth's "Dancing Links" algorithm. Last year I used a Python 2.6 version of Assaf's code to find solutions for 4 colour mapping, and also to find latin squares, but my home-grown latin square maker is much faster, especially when the square size is large. — PM 2Ring, Sep 18 '14 at 06:50
Thanks for pointing me to the exact cover problem. However I think my specific problem is more complicated than the standard case: in some sense, I need to solve 2 exact cover problems in one go. First, a cover among elements for each qualified compositions, and then a cover among qualified compositions. — MLister, Sep 18 '14 at 17:00
If I understand your question correctly, the problem is incredibly symmetric and as long as `n` is a multiple of `k`, there's always a solution. Do you want a single solution or all solutions? (note that even printing a single solution will be θ(2^n) in case 2k~n, as there is exponentially many k-subsets of that size). — liori, Sep 18 '14 at 22:27
@liori, yes, I recognize that there always exists a solution as long as `n` is a multiple of `k`, and that there may exist multiple solutions that satisfy the criterion above. I'm interested in finding **any one** solution. We need to optimize it as much as possible, especially if we want `n>=20` with `k=2`. — MLister, Sep 18 '14 at 23:58
You should post this question in http://codegolf.stackexchange.com/ and challenge them not only to solve it, but to solve it in the minimal number of code characters ;) Someome will post a CJam solution within minutes :D — GreenAsJade, Sep 20 '14 at 05:54

David Eisenstat · Accepted Answer · 2020-11-03T16:43:40.263

I implemented the integral maximum flow construction that's used to prove Baranyai's theorem. More details in your favorite textbook covering factors of a complete hypergraph.

from collections import defaultdict
from fractions import Fraction
from math import factorial
from operator import itemgetter


def binomial(n, k):
    return factorial(n) // (factorial(k) * factorial(n - k))


def find_path(graph, s, t):
    stack = [s]
    predecessor = {s: t}
    while stack:
        v = stack.pop()
        for u in graph[v]:
            if u not in predecessor:
                stack.append(u)
                predecessor[u] = v
    assert t in predecessor
    path = [t]
    while path[-1] != s:
        path.append(predecessor[path[-1]])
    path.reverse()
    return path


def round_flow(flow):
    while True:
        capacities = []
        for (u, v), x in flow.items():
            z = x - x.numerator // x.denominator
            if z:
                capacities.append(((v, u), z))
                capacities.append(((u, v), 1 - z))
        if not capacities:
            break
        (t, s), delta = min(capacities, key=itemgetter(1))
        graph = defaultdict(list)
        for (v, u), z in capacities:
            if (v, u) not in [(s, t), (t, s)]:
                graph[v].append(u)
        path = find_path(graph, s, t)
        for i, v in enumerate(path):
            u = path[i - 1]
            if (u, v) in flow:
                flow[(u, v)] += delta
            else:
                flow[(v, u)] -= delta


def baranyai(n, k):
    m, r = divmod(n, k)
    assert not r
    M = binomial(n - 1, k - 1)
    partition = [[()] * m for i in range(M)]
    for l in range(n):
        flow = defaultdict(Fraction)
        for i, A_i in enumerate(partition):
            for S in A_i:
                flow[(i, S)] += Fraction(k - len(S), n - l)
        round_flow(flow)
        next_partition = []
        for i, A_i in enumerate(partition):
            next_A_i = []
            for S in A_i:
                if flow[(i, S)]:
                    next_A_i.append(S + (l,))
                    flow[(i, S)] -= 1
                else:
                    next_A_i.append(S)
            next_partition.append(next_A_i)
        partition = next_partition
    assert len(partition) == M
    classes = set()
    for A in partition:
        assert len(A) == m
        assert all(len(S) == k for S in A)
        assert len({x for S in A for x in S}) == n
        classes.update(map(frozenset, A))
    assert len(classes) == binomial(n, k)
    return partition


if __name__ == '__main__':
    print(baranyai(9, 3))
    print(baranyai(20, 2))

Let me dump an email that I wrote about this answer here, since it may be useful to others.

Unfortunately there's nothing that would be better suited as a source for transliteration.

The construction I used is due to Brouwer and Schrijver (1979). I could see only half of it at the time because I was looking on Google Books, but there's a PDF floating around now. It's a high-level mathematical description of an inductive proof that sets up a max flow problem, exhibits a fractional solution, and asserts the existence of an integer solution without going into detail about how to get it. My Python implementation used pipage rounding to follow the exact structure of the proof, but if you're just trying to get work done I would suggest calling out to an R library that computes max flows (e.g., igraph).

Let me dash off a concrete example of how to set up the max flows because the abstract proof was pretty mysterious to me before I figured it out. The smallest non trivial example is n=6 and k=3. This means we have (6-1) choose (3-1) = 10 partitions, each with 2 sets of size 3. Starting off with a 10 by 2 by 3 array having all elements blank, the B&S proof calls for us to place the 1s (one per row), then the 2s, then ..., then the 6s. Placing the 1s is boring due to symmetry considerations, so I'll just do it without explanation. (You don't need to special case this in code.)

{1,_,_}, {_,_,_}
{1,_,_}, {_,_,_}
{1,_,_}, {_,_,_}
{1,_,_}, {_,_,_}
{1,_,_}, {_,_,_}
{1,_,_}, {_,_,_}
{1,_,_}, {_,_,_}
{1,_,_}, {_,_,_}
{1,_,_}, {_,_,_}
{1,_,_}, {_,_,_}

Starting with the 2s, things get interesting. The key invariant in intuitive English is that we want each censored subset to appear exactly as many times as we would expect. Of the 6 choose 3 = 20 size-3 subsets of {1, 2, ..., 6} with numbers >2 censored with , there are 4 choose 1 = 4 subsets that look like {1,2,} and 4 choose 2 = 6 that look like {1,,} and 4 choose 2 = 6 that look like {2,,} and 4 choose 3 = 4 that look like {,,_}. What we want to do is place one 2 in each row to get this distribution. Easy enough to do this one by hand:

{1,2,_}, {_,_,_}
{1,2,_}, {_,_,_}
{1,2,_}, {_,_,_}
{1,2,_}, {_,_,_}
{1,_,_}, {2,_,_}
{1,_,_}, {2,_,_}
{1,_,_}, {2,_,_}
{1,_,_}, {2,_,_}
{1,_,_}, {2,_,_}
{1,_,_}, {2,_,_}

When we get to 3 we have 3 choose 0 = 1x {1,2,3}, 3 choose 1 = 3x {1,2,}, 3x {1,3,}, 3x {2,3,}, 3 choose 2 = 3x {1,,}, 3x {2,,}, 3x {3,,}, 3 choose 3 = 1x {,,}. Actual choices to make! This is where max flow comes in.

Let ell be the number that we're placing. The flow network we construct has a source, a vertex for each row, a vertex for each censored subset containing ell, and a sink. There is an arc from the source to each row vertex of capacity 1. There is an arc from each censored subset S to the sink of capacity (n-1 - ell) choose (k - |S|). There is an arc of capacity 1 from each row vertex to each censored subset that could appear in that row if we placed ell there.

Lettering the rows a..j, the middle arcs look like

a-->{1,2,3}
a-->{3,_,_}
b-->{1,2,3}
b-->{3,_,_}
c-->{1,2,3}
c-->{3,_,_}
d-->{1,2,3}
d-->{3,_,_}
e-->{1,3,_}
e-->{2,3,_}
...
j-->{1,3,_}
j-->{2,3,_}

Get an integral max flow, which will place exactly one ell in each row. The placement looks something like

{1,2,3}, {_,_,_}
{1,2,_}, {3,_,_}
{1,2,_}, {3,_,_}
{1,2,_}, {3,_,_}
{1,3,_}, {2,_,_}
{1,3,_}, {2,_,_}
{1,3,_}, {2,_,_}
{1,_,_}, {2,3,_}
{1,_,_}, {2,3,_}
{1,_,_}, {2,3,_}

And on it goes until we get

{1,2,3}, {4,5,6}
{1,2,4}, {3,5,6}
{1,2,5}, {3,4,6}
{1,2,6}, {3,4,5}
{1,3,4}, {2,5,6}
{1,3,5}, {2,4,6}
{1,3,6}, {2,4,5}
{1,4,5}, {2,3,6}
{1,4,6}, {2,3,5}
{1,5,6}, {2,3,4}

Hope this helps!

thank you for the answer. It solves only part of the problem, as the recursive function gives all the covers/compositions for the original set of elements. I think we need to solve the exact cover problem again for these compositions to satisfy the criterion in the 2nd step of my problem. — MLister, Sep 18 '14 at 17:05
@MLister I updated this answer to use Baranyai's theorem to construct the object that you really want in one step. — David Eisenstat, Sep 20 '14 at 05:11
this is very cool and it significantly speed up the problem-solving comparing to solving the exact cover problem using X algorithm twice. I looked up Baranyai's theorem, and it does fit my problem better. I'm still looking for a good tutorial on using the max flow construction to solve the problem. Thanks a lot. — MLister, Sep 21 '14 at 19:52
@MLister I'm not sure there is one. I was working from a Google Books result that hid the second page of the proof. At a high level, if we have a solution to your problem and we delete all of the numbers `l..n-1`, then we know how often each subset of `0..l-1` should occur. We need to insert `l` exactly once into each of the outer sets so as to respect the cardinality constraint for subsets of `0..l`. We can count how many times each subset of `0..l-1` needs to be extended, and the whole thing turns into a flow problem, for which there's an easy to write fractional solution. — David Eisenstat, Sep 21 '14 at 20:08
@DavidEisenstat Thanks so much for this implementation! I had been looking everywhere for an implementation of integral maximum flow to help solidify my understanding of the proof, and this did the job perfectly. =) — PancakesOwn, Aug 22 '18 at 09:07

Find unique compositions of k-distinct-element subsets for a set of n elements

1 Answers1

Linked

Related