iterate through all L length sequences of N symbols, that include all N symbols

Question

What is an efficient way to do the following in python?
Given N symbols, iterate through all L length sequences of N symbols, that include all N symbols.

The order does not matter, as long as all sequences are covered, and each only once.

Let's call this iterator seq(symbols,L). Then, for example,
list(seq([1,2,3],2))=[]
list(seq([1,2,3],3))=[(1, 2, 3), (1, 3, 2), (2, 1, 3), (2, 3, 1), (3, 1, 2), (3, 2, 1)]
list(seq([1,2,3],4))=[(1, 1, 2, 3), (1, 1, 3, 2), (1, 2, 1, 3), ...

Here's an intuitive, yet slow implementation:

import itertools

def seq(symbols,L):
  for x in itertools.product(symbols,repeat=L):
    if all(s in x for s in symbols):
      yield x

When N is large and L is close to N, there is a lot of wasted effort. For example, when L==N, it would be much better to use itertools.permutations(). Since every sequence needs to have all N symbols, it seems like a better solution would somehow start with the permuted solution, then add in the extra repeated symbols somehow, but I can't figure out how to do this without double counting (and without resorting to saving all previous output to check for a repeat).

Is this what you want? http://en.wikipedia.org/wiki/Permutation#Algorithms_to_generate_permutations — Tarzan, Jan 28 '14 at 04:26

Niklas B. · Accepted Answer · 2014-01-28T07:03:35.313

2

An idea:

import itertools
def solve(size, symbols, todo = None):
  if todo is None: todo = frozenset(symbols)
  if size < len(todo): return
  if size == len(todo):
    yield from itertools.permutations(todo)  # use sorted(todo) here 
                                             # for lexicographical order
    return
  for s in symbols:
    for xs in solve(size - 1, symbols, todo - frozenset((s,))):
      yield (s,) + xs

for x in solve(5, (1,2,3)):
  print(x)

Will print all sequences of size 5 that contain each of 1,2,3 and 2 more arbitrary elements. You can use bitmasks instead of a set if you aim for efficiency, but I guess you're not since you are using Python :) The complexity is optimal in the sense that it is linear in the output size.

Some "proof":

 $ python3 test.py | wc -l                               # number of output lines
 150
 $ python3 test.py | sort | uniq | wc -l                 # unique output lines
 150
 $ python3 test.py | grep "1"|grep "2"|grep "3"| wc -l   # lines with 1,2,3
 150

edited Jan 28 '14 at 07:03

answered Jan 28 '14 at 04:24

Niklas B.

92,950
18
194
224

Thanks for the hint on the bitmasks, as I'll probably be writing this up in C at some point (I like playing with algorithms in python, as it is easier to think about there). I'm not convinced about the complexity comment though, as this is still "checking" a lot of sequences only to toss them out. For example it will try starting with (1,1,1,1) then realizes that won't work, then it rejects (1,1,1,2,1) and then (1,1,1,2,2), etc. So it does prune better than my initial slow solution, but it still feels like there should be a solution taking advantage of itertools.permutations(symbols). – EdBrown Jan 28 '14 at 04:53
@EdBrown: Granted, it does a single recursion that is invalid, only to prune immediately in the `if size < len(todo): return` line. That's still `O(1)` per transition though ;) – Niklas B. Jan 28 '14 at 04:56
Not quite O(1), since building that set might be more expensive. – user2357112 Jan 28 '14 at 04:57
@user2357112: Yeah sure, but you can use bitmasks instead for `O(1)` set operations at least for reasonable symbol set sizes (I'm assuming here that the `symbols` set will be small enough for the algorithm to finish in some reasonable amount of time). The tuple prefix is never actually built, though – Niklas B. Jan 28 '14 at 04:58
Maybe I'm not understanding. Let's say we're at size=1 and todo=(3,). This will try all N symbols to discover that only 3 works. That doesn't sound like an O(1) to me. That sounds like O(N). I would therefore put the complexity at O(N^N), similar to my example solution. Sorry if I'm slow to understand, what am I missing? – EdBrown Jan 28 '14 at 05:05
@EdBrown: I think you might actually be right. I thought because that situation only happens if `size == len(todo)`, we can charge the useless operations to the work we must have done before to get to that situation. That's not true however, as evidenced by the input family `size = n, symbols = (1,...,n)`, which requires `O(n^n)` work as you noted. It's easy to fix though, lemme edit. – Niklas B. Jan 28 '14 at 05:09
@EdBrown: Using `itertools.permutations` in the case where `size == len(todo)` should do the trick. Since there is no longer a `return` in that function like before, I must really be lacking some serious sleep if I have again missed something here. – Niklas B. Jan 28 '14 at 05:13
@NiklasB. Nice, that makes sense. I think leaving in the "if size < len(todo): return" is a good idea though. Thanks much for the help! What I actually need is a small generalization of this, which your setup already naturally handles by allowing me to set the 'todo' when calling. Very nice. – EdBrown Jan 28 '14 at 05:23
@NiklasB, very close. Now what happens for `list(solve(2, (1,2,3)))`? Infinite recursion ;-) Time to do the heavy lifting in a nested function, and do the trivial special-casing outside that. – Tim Peters Jan 28 '14 at 05:24
@TimPeters: That's not even a valid input. I tend to introduce undefined behaviour if people don't follow my contracts :) – Niklas B. Jan 28 '14 at 05:24
@NiklasB, of course it's valid - the OP even showed a case with L < N in the original question ("list(seq([1,2,3],2))=[]"). – Tim Peters Jan 28 '14 at 05:26
@TimPeters: Right, that actually makes sense :) Leaving the `if size < len(todo): return` in as OP suggested will take care of this. – Niklas B. Jan 28 '14 at 05:27
Yup! BTW, I upvoted your answer a long time ago - I'm just quibbling now ;-) – Tim Peters Jan 28 '14 at 05:27
@Tim: Damn, thought I might be fishing for some upvotes here. [algorithm] really isn't the most rewarding of tags... – Niklas B. Jan 28 '14 at 05:29
BTW, note that the guarantee of lexicographic order has been lost, since there's no defined order in which the elements of a set are traversed (`permutations(todo)`). The OP explicitly said they didn't care about order, so it's just that one of the promises you made in your answer is no longer true. – Tim Peters Jan 28 '14 at 05:38
@TimPeters: Yep, good catch. Probably not worth using a sort just to make it happen either. – Niklas B. Jan 28 '14 at 05:39

score 1 · Answer 2 · answered Jan 28 '14 at 05:29

1

You can do this by breaking the problem into two parts:

Find every possible multiset of size L of N symbols which includes every symbol at least once.
For each multiset, find all unique permutations.

For simplicity, let's suppose the N symbols are the integers in range(N). Then we can represent a multiset as a vector of length N whose values are non-negative integers summing to L. To restrict the multiset to include every symbol at least once, we require that the values in the vector all be strictly positive.

def msets(L, N):
  if L == N:
    yield (1,) * L
  elif N == 1:
    yield (L,)
  elif N > 0:
    for i in range(L - N + 1):
      for m in msets(L - i - 1, N - 1):
        yield (i + 1,) + m

Unfortunately, itertools.permutations does not produce unique iterations of lists with repeating elements. If we were writing this in C++, we could use std::next_permutation, which does produce unique iterations. There is a sample implementation (in C++, but it's straightforward to convert it to Python) on the linked page.

answered Jan 28 '14 at 05:29

rici

234,347
28
237
341

I might be missing something, but why is `len(list(msets(5,3))) = 6`? – Niklas B. Jan 28 '14 at 05:32
@Niklas: What would you expect it to be? There are six possible multisets of length 5 of the elements (0,1,2) which contain each element at least once. – rici Jan 28 '14 at 05:34
Ah, I thought you're producing the permutations as well. Nevermind :) – Niklas B. Jan 28 '14 at 05:35
@NiklasB.:it's a bit late here; I think the C++ code is sufficiently clear for an algorithm question, but I could write the python equivalent in the morning. – rici Jan 28 '14 at 05:36
BTW, your function is essentially `(tuple(xrange(N)) + y for y in itertools.combinations_with_replacement(xrange(N), L - N))` – Niklas B. Jan 28 '14 at 05:47
This approach is extra intuitive, I like it. Instead of a msets giving a "counting vector", why not have msets do something like, for x in itertools.combinations_with_replacement(range(N),L-N): yield range(N)+x. Then just "next_permutation" over everything mset returns. Does that sound valid? – EdBrown Jan 28 '14 at 05:50
Oh, Niklas beat me to it. And here's some python next_permutation code, along with speed tests, if anyone is interested: http://stackoverflow.com/questions/6534430/why-does-pythons-itertools-permutations-contain-duplicates-when-the-original – EdBrown Jan 28 '14 at 05:51
@NiklasB.: that's pretty. But the logic of next_permutation requires the initial sequence to be sorted (trivial change, of course). – rici Jan 28 '14 at 16:45
@NiklasB. what happened to your answer? It disappeared! – EdBrown Jan 29 '14 at 00:46
@EdBrown: Sorry I had deleted it... it's undeleted now – Niklas B. Jan 29 '14 at 00:46

iterate through all L length sequences of N symbols, that include all N symbols

2 Answers2