Efficiently generating combinations without duplicate subsets

Question

Given I have N (1 ≤ N) numbers, 1 .. N, efficiently generate the maximum number of possible L (1 ≤ L ≤ N) sized combinations of it with the restriction that any U length subset of a combination (1 ≤ U ≤ L) only appears once in the result. There are usually many results satisfying the constraints - any result will do.

For example, if N is 5, L is 3 and the final constraint is dropped (ie, the answer is just the combinations), then we have:

123,124,125,134,135,145,234,235,245,345;

Once a U is 2 constraint is introduced any of these lines is an acceptable solution, provided it is generated efficiently:

123,145;
123,245;
123,345;
124,135;
124,235;
124,345;
125,134;
125,234;
125,345;

The ideal run time is O(size_of_output). In my use case N is always much bigger (an order of magnitude or more) than L, so anything faster than calculating all the combinations will be an improvement on what I came up with up (which is too slow):

import itertools

def unique_combinations(population, length, unique):
    seen = set()
    for r in itertools.combinations(range(population), length):
        u = set(itertools.combinations(r, unique))
        if not (u & seen):
            yield r
            seen |= u

Bonus points for providing a way deterministically pick any given solution from the list of valid ones. For example, there were 9 valid solution to the (N=5,L=3,U=2) example above. Being able to an additional parameter 1 .. 9 selecting which was returned would be really nice.

A simple formula for calculating the number of combinations in the result would also be helpful.

You have shown 2 pairs for 123 while there are thee ones: `123 145; 123 245; 123 345` so seems 18 variants do exist for your example. — MBo, Nov 26 '18 at 07:23
MBo - All three of your combinations share the pair `45`. It can be in at most one of them. — Russell Stuart, Nov 26 '18 at 10:09
Don't understand yet. Four of your combinations share pair 45 — MBo, Nov 26 '18 at 10:18
I’m a little confused by your output. You say you are drawing from a pool of combinations, but why are your outputs shuffled? — Joseph Wood, Nov 26 '18 at 10:32
OP, I think you're misreading @MBo's comment. You can't add line breaks in comments, so here a semicolon means what a period and linebreak meant in your question: these are three separate combinations of size L=2, in none of which is there a shared subset of size U=2. — jwimberley, Nov 26 '18 at 15:08
MBo, jimberley: I did indeed misread MBo's comment. I had calculated the solutions by hand and he is correct, I missed some. I adapted my python to find them using brute force, and updated the answer accordingly, adopting MBo's more concise notation in the process. There are 9 solutions. — Russell Stuart, Nov 26 '18 at 21:17
Can you explain more about the `U` constraint. I don't get it. I don't read minds. I don't use Python very often and have not used itertools ever. If the solution with the language I think will do it, then it will look too simple, but then I can skip the explanation and you have to read my mind. — Guy Coder, Nov 26 '18 at 23:33
@GuyCoder, the background is I want to generate strings that look mostly different. Given I have n smaller strings `s1..sn`, I can generate ⁿCₗ strings by combining `l` of them at a time. However for any given output there will be `n-1` outputs containing `l-1`identical `sN`s. `U` allows me to say `l-1` is too high, I can only tolerate a lower figure. — Russell Stuart, Nov 26 '18 at 23:37
That still is not clear. Examples with detailed explanations would help. — Guy Coder, Nov 26 '18 at 23:54

Guy Coder · Answer 1 · 2018-11-27T12:58:35.723

This is a comment put in an answer because it is to large for a comment.

This is a partial answer to get the OP to write better examples in his question. I don't understand the U parameter and as I noted in the comments I might have a really simple solution to his problem but it probably won't be in something he understands. Since I can't understand all of the question here is an answer he probably won't be able to understand. And yes I can write great answers and great questions.

So if the OP can explain the U parameter so that I can understand it, then I can see if my idea works, and if so post the answer and explain it here. But if the OP doesn't help me help him, then the answer may be sweep away in the sands of time.

For the first part

comb(0,_,[]).
comb(N,[X|T],[X|Comb]) :-
    N>0,
    N1 is N-1,
    comb(N1,T,Comb).
comb(N,[_|T],Comb) :-
    N>0,
    comb(N,T,Comb).

when run returns

?- comb(3,[1,2,3,4,5],C).
C = [1, 2, 3] ;
C = [1, 2, 4] ;
C = [1, 2, 5] ;
C = [1, 3, 4] ;
C = [1, 3, 5] ;
C = [1, 4, 5] ;
C = [2, 3, 4] ;
C = [2, 3, 5] ;
C = [2, 4, 5] ;
C = [3, 4, 5] ;
false.

Efficiently generating combinations without duplicate subsets

1 Answers1