1

I would like to get all possible combinations of the letters in a string with length k. I know there are many posts on this but I have a little twist k is larger than the length of the string.

This is what I have so far, its simple and works if k <= len(string):

 string = 'ABCD'
 permutations = ["".join(x) for x in itertools.permutations(string, k)]

results if k = 4:

 ['ABCD', 'ABDC', 'ACBD', 'ACDB', 'ADBC', 'ADCB', 'BACD', 'BADC', 'BCAD', 'BCDA', 
'BDAC','BDCA', 'CABD', 'CADB', 'CBAD', 'CBDA', 'CDAB', 'CDBA', 'DABC', 'DACB', 
'DBAC', 'DBCA', 'DCAB', 'DCBA']

This works as expected. However, I would like all possible combinations of these four letters with k > len(string).

An example answer I would like would be:

string = 'AB'
k = 4
result = ['AAA,'ABB','AAB', 'ABA','BBB', 'BAA'.......]

Thanks in advance.

Samantha
  • 321
  • 2
  • 11
  • 4
    There are 4^15 strings of length 15 consisting of A, B, C and D. That's over **one billion** strings. Of course your computer jammed up. (Note that there are **one trillion** permutations of 15 elements, which is what you asked your computer to calculate) – nneonneo Oct 17 '13 at 21:04
  • You should give an example of what output you want to see. – nneonneo Oct 17 '13 at 21:06
  • 1
    @nneonneo: right, that way SO can be jammed up trying to process a multi-GB HTTP POST, instead of just the questioner's computer being jammed up ;-) – Steve Jessop Oct 17 '13 at 21:11
  • @SteveJessop: I'm relatively sure that such a POST would fail. At least, I hope so. (sudden urge to try) – nneonneo Oct 17 '13 at 21:12
  • Note that `itertools.permutations` is giving you the number of well.... permutations, not combinations. Following what @nneonneo said, if you're allowing repeated letters there are m^n possible strings of length n, using m symbols. Given a string s of length n, there are n! (factorial) permutations of s. A permutation is a rearrangement that uses each of the original items once and only once. – axblount Oct 17 '13 at 21:15
  • Ok yes this is why it jammed up makes sense. Sorry I'm very new to programming. Another possibility would be to not store all combinations. I'm trying to search a very large string for the number of occurrences of each combination and see which combination occurs the most often. – Samantha Oct 17 '13 at 21:23
  • 1
    @Samantha: Then ask that. Don't fall into the trap of the X/Y problem. (If you actually check every single combination, your memory and CPU time will explode. There are much, much easier ways). – nneonneo Oct 17 '13 at 22:31
  • The example you gave with `'AB'` and `k = 4` does not follow the same logic as the one you gave with `'ABCD'` and `k = 4`. I find this misleading. I stuck with the interpretation of the latter to build my answer. – Rerito Oct 17 '13 at 22:38
  • 1
    Your examples don't seem to correspond. You have `'AAA'` in the example at the end, but you don't have `'AAAA'` in the output from the `permutations` code that you say is correct. – Steve Jessop Oct 17 '13 at 22:39

3 Answers3

9

You may well want

itertools.product(string, repeat=k)

instead. Try it! Your description is ill-defined, so can't guess for sure.

Example:

>>> import itertools
>>> for p in itertools.product("ab", repeat=3):
...     print p
('a', 'a', 'a')
('a', 'a', 'b')
('a', 'b', 'a')
('a', 'b', 'b')
('b', 'a', 'a')
('b', 'a', 'b')
('b', 'b', 'a')
('b', 'b', 'b')
Tim Peters
  • 67,464
  • 13
  • 126
  • 132
  • 2
    And also be aware that you're unlikely to have enough memory to be able to make a list out of this for k=15. But that's OK, the point of `itertools` in part is to generate sequences too big for memory. – Steve Jessop Oct 17 '13 at 21:17
3

Based on your comment:

I'm trying to search a very large string for the number of occurrences of each combination and see which combination occurs the most often.

There's another way to do what you want:

def substrings(vlarge, k):
    return (vlarge[idx:idx+k] for idx in range(len(vlarge)-k+1))

def uses_only(value, chars):
    return all(ch in chars for ch in value)

def most_common(vlarge, chars, k):
    return collections.Counter(s for s in substrings(vlarge, k) if uses_only(s, chars)).most_common(1)

You can then look at making this basic idea more efficient: for example if you encounter an 'x' character in vlarge then you know that none of the substrings that include it are combinations of 'abcd'. So you can skip ahead to the substring that starts one place after the x:

def generate_substrings(vlarge, chars, k):
    idx = 0
    goodrange = 0
    while idx <= len(vlarge) - k:
        while goodrange < idx + k:
            if vlarge[goodrange] in chars:
                goodrange += 1
            else:
                idx = goodrange + 1
                if idx > len(vlarge) - k:
                    return
                goodrange = idx
        yield vlarge[idx:goodrange]
        idx += 1

def most_common(vlarge, chars, k):
    return collections.Counter(generate_substrings(vlarge, chars, k)).most_common(1)

Compared with this approach, the "obvious" idea (iterate over all combinations, counting how many times they appear as a substring, and keeping track of the best so far) uses less memory but will be a lot slower, since it has to make a lot of passes over the very large string.

If I have misunderstood what you mean by "combinations", that is to say if my uses_only function is incorrect, then you'd have to adjust my idea accordingly. The point is: count the actual substrings of the form you want, because there are fewer of them than there are hypothetical substrings of the correct form.

Steve Jessop
  • 273,490
  • 39
  • 460
  • 699
1

My answer will only be a theoretical analysis of what you are doing. I will denote the binomial coefficient defining the number of parts containing k elements of a set containing n elements by C(k,n).

Suppose you have a string of length n ∈ ℕ* and k ∈ ℕ, k ⩾ n. I will assume all the characters in your string are distinct.
I understood that you are trying to build a string of k characters extracted from your input string.

A combination of your string characters can be seen as a permutation of ⟦1, n⟧. There are n! such permutations ...

Then, when k > n, things are getting much worse ... Let r = k mod n and p = (k - r)/n. Obviously we have :

p ⩾ 1
0 ⩽ r < p

Your output string can there be decomposed in p "complete" substrings made out of a permutation of your n input characters and one substring made out of only r characters of your input string.

To build such an "incomplete" substring, you first have to choose a subset of r characters of your input string and then a permutation of such characters. Finally, the number sr,n of such possible substrings is :

sr,n = C(r,n).r!

Note that this formula won't lead to an invalid global result when r = 0, since C(0,n) = 1 and 0! = 1 by convention.

The final number of k-long strings you can build according to your scheme is :

stot = (C(r,n).r!).(n!)p

This number is outrageously high !

With k = 4 and n = 2, we have :

stot = (C(0,4).0!).(2!)2 = 4

result = ['ABAB', 'ABBA', 'BAAB', 'BABA']
Rerito
  • 5,886
  • 21
  • 47