8

I'm trying to solve the general problem of getting the unique combinations from a list in Python

Mathematically from https://www.mathsisfun.com/combinatorics/combinations-permutations-calculator.html I can see that the formula for the number of combinations is n!/r!(n-r)! where n is the length of the sequence and r is the number to choose.

As shown by the following python where n is 4 and r is 2:

lst = 'ABCD'
result = list(itertools.combinations(lst, len(lst)/2))
print len(result)
6

The following is a helper function to show the issue I have:

def C(lst):
    l = list(itertools.combinations(sorted(lst), len(lst)/2))
    s = set(l)
    print 'actual', len(l), l
    print 'unique', len(s), list(s)

If I run this from iPython I can call it thus:

In [41]: C('ABCD')
actual 6 [('A', 'B'), ('A', 'C'), ('A', 'D'), ('B', 'C'), ('B', 'D'), ('C', 'D')]
unique 6 [('B', 'C'), ('C', 'D'), ('A', 'D'), ('A', 'B'), ('A', 'C'), ('B', 'D')]

In [42]: C('ABAB')
actual 6 [('A', 'A'), ('A', 'B'), ('A', 'B'), ('A', 'B'), ('A', 'B'), ('B', 'B')]
unique 3 [('A', 'B'), ('A', 'A'), ('B', 'B')]

In [43]: C('ABBB')
actual 6 [('A', 'B'), ('A', 'B'), ('A', 'B'), ('B', 'B'), ('B', 'B'), ('B', 'B')]
unique 2 [('A', 'B'), ('B', 'B')]

In [44]: C('AAAA')
actual 6 [('A', 'A'), ('A', 'A'), ('A', 'A'), ('A', 'A'), ('A', 'A'), ('A', 'A')]
unique 1 [('A', 'A')]

What I want to get is the unique count as shown above but doing a combinations and then set doesn't scale. As when the length of lst which is n gets longer it slows down as the combinations get greater and greater.

Is there a way of using math or Python tricks to to solve the issue of counting the unique combinations ?

sotapme
  • 4,695
  • 2
  • 19
  • 20
  • 3
    Is it enough to count how many, or do you want the actual combinations as well? – Imran Feb 03 '18 at 22:32
  • 3
    It seems like you can find the unique combinations by just omitting any duplicates from the original list and then generating combinations from the reduced list. E.g. from ABBC, take ABC and generate combinations from that. Perhaps I'm missing something here. – Robert Dodier Feb 03 '18 at 22:36
  • 1
    @Robert no way to get (B,B) from your reduced list. – Imran Feb 03 '18 at 22:37
  • 2
    https://stackoverflow.com/questions/36429507/python-combinations-without-repetitions .. 4th answer down.. uses Counter – johnashu Feb 03 '18 at 22:40
  • sorry the last one.. i counted the question as an answer lol – johnashu Feb 03 '18 at 22:49
  • @Imran yes, you're right. But it's easy to fix up, right? For any duplicated element X, include (X, X) with the combinations generated from the reduced list. I guess it becomes more complicated when r > 2. After thinking about it for a minute, it seems like handling it carefully requires an approach more or less the same as https://stackoverflow.com/a/46623112/871096. – Robert Dodier Feb 03 '18 at 22:56
  • With 3000+ reputation, shouldn't you know better than to disappear and not answer clarification questions? – Stefan Pochmann Feb 03 '18 at 23:17
  • Does your function really what you think it does? Check with `lst="ABCDAB"`... – Darkonaut Feb 04 '18 at 01:54
  • @Imran yes counting is good enough. – sotapme Feb 04 '18 at 13:06
  • @StefanPochmann well I wrote it before going to bed, hoping for smarter people than me to help. Then I woke and had to take my son to football. – sotapme Feb 04 '18 at 13:10
  • @Darkonaut my intention is to get the unique combinations, so I think it does what I want ie. to illustrates for "ABCDAB" there are 20 combinations and 10 unique ones. – sotapme Feb 04 '18 at 13:12
  • Does this need to work for extremely large data sets, or can you just generate all combinations, throw them in a set, and return its count? – Kenny Ostrom Feb 04 '18 at 14:09
  • @KennyOstrom I've tried running it for a large? datasets of > 30 and `itertools.combinations(lst, 15)` takes a very long time to comeback. So doing a `set(combinations(lst, 15))` doesn't scale. – sotapme Feb 04 '18 at 15:15
  • 1
    @johnashu looking at that Counter example may do what I want :O - although I've had to fix it a bit to make it work as it was failing as posted. – sotapme Feb 04 '18 at 15:17
  • 1
    @sotapme What confuses me is that you don't hardcode `r`. What's the reasoning behind calculating `r` dynamically with `len(lst)/2` in the first place?. – Darkonaut Feb 04 '18 at 16:21
  • If you have found am answer please answer yourself as I am also intrigued by your solution.. a handy function for reference – johnashu Feb 04 '18 at 19:51

3 Answers3

7

Here's some Python code based on the generating function approach outlined in this Math Forum article. For each letter appearing in the input we create a polynomial 1 + x + x^2 + ... + x^k, where k is the number of times that the letter appears. We then multiply those polynomials together: the nth coefficient of the resulting polynomial then tells you how many combinations of length n there are.

We'll represent a polynomial simply as a list of its (integer) coefficients, with the first coefficient representing the constant term, the next coefficient representing the coefficient of x, and so on. We'll need to be able to multiply such polynomials, so here's a function for doing so:

def polymul(p, q):
    """
    Multiply two polynomials, represented as lists of coefficients.
    """
    r = [0]*(len(p) + len(q) - 1)
    for i, c in enumerate(p):
        for j, d in enumerate(q):
            r[i+j] += c*d
    return r

With the above in hand, the following function computes the number of combinations:

from collections import Counter
from functools import reduce

def ncombinations(it, k):
    """
    Number of combinations of length *k* of the elements of *it*.
    """
    counts = Counter(it).values()
    prod = reduce(polymul, [[1]*(count+1) for count in counts], [1])
    return prod[k] if k < len(prod) else 0

Testing this on your examples:

>>> ncombinations("abcd", 2)
6
>>> ncombinations("abab", 2)
3
>>> ncombinations("abbb", 2)
2
>>> ncombinations("aaaa", 2)
1

And on some longer examples, demonstrating that this approach is feasible even for long-ish inputs:

>>> ncombinations("abbccc", 3)  # the math forum example
6
>>> ncombinations("supercalifragilisticexpialidocious", 10)
334640
>>> from itertools import combinations  # double check ...
>>> len(set(combinations(sorted("supercalifragilisticexpialidocious"), 10)))
334640
>>> ncombinations("supercalifragilisticexpialidocious", 20)
1223225
>>> ncombinations("supercalifragilisticexpialidocious", 34)
1
>>> ncombinations("supercalifragilisticexpialidocious", 35)
0
>>> from string import printable
>>> ncombinations(printable, 50)  # len(printable)==100
100891344545564193334812497256
>>> from math import factorial
>>> factorial(100)//factorial(50)**2  # double check the result
100891344545564193334812497256
>>> ncombinations("abc"*100, 100)
5151
>>> factorial(102)//factorial(2)//factorial(100)  # double check (bars and stars)
5151
Glorfindel
  • 21,988
  • 13
  • 81
  • 109
Mark Dickinson
  • 29,088
  • 9
  • 83
  • 120
  • Just another double check: `len(set(itertools.combinations(sorted("supercalifragilisticexpialidocious"), 10)))` didn't take long to confirm 334640. And I trust this more than the double checks for extreme cases of no duplicates and endless supply. – Stefan Pochmann Feb 04 '18 at 21:36
  • 1223225 for k=20 confirmed that way as well (in about 3 minutes), and also all other results for k from 17 to 34. – Stefan Pochmann Feb 04 '18 at 22:09
  • @StefanPochmann: Thanks! I've added the `k=10` double check to the examples. – Mark Dickinson Feb 05 '18 at 08:38
  • @MarkDickinson Is there a way to also get the unique combinations (not only the count)? – fbparis Jul 18 '20 at 18:43
  • @fbparis: A simple dynamic programming approach should let you generate the permutations in lexicographic order, without any repetitions. It's probably worth a separate question (if that question doesn't already exist). – Mark Dickinson Jul 19 '20 at 13:29
  • 1
    @fbparis: See https://stackoverflow.com/q/4250125/270986 and the algorithm in `more-itertools` that's mentioned in an answer to that question. – Mark Dickinson Jul 19 '20 at 13:33
  • @MarkDickinson Thanks! Those seem to be more about permutations. I've added a response here for yielding combinations in a faster way using multi sets because that was the title of this page, but I guess if we only want combinations count your method is still faster! – fbparis Jul 19 '20 at 15:30
5

Start with a regular recursive definition of combinations() but add a test to only recurse when the lead value at that level hasn't been used before:

def uniq_comb(pool, r):
    """ Return an iterator over a all distinct r-length
    combinations taken from a pool of values that
    may contain duplicates.

    Unlike itertools.combinations(), element uniqueness
    is determined by value rather than by position.

    """
    if r:
        seen = set()
        for i, item in enumerate(pool):
            if item not in seen:
                seen.add(item)
                for tail in uniq_comb(pool[i+1:], r-1):
                    yield (item,) + tail
    else:
        yield ()

if __name__ == '__main__':
    from itertools import combinations

    pool = 'ABRACADABRA'
    for r in range(len(pool) + 1):
        assert set(uniq_comb(pool, r)) == set(combinations(pool, r))
        assert dict.fromkeys(uniq_comb(pool, r)) == dict.fromkeys(combinations(pool, r))
Raymond Hettinger
  • 216,523
  • 63
  • 388
  • 485
0

It seems that this is called a multiset combination. I've faced the same problem and finally came up rewriting a function from sympy (here).

Instead of passing your iterable to something like itertools.combinations(p, r), you pass collections.Counter(p).most_common() to the following function to directly retrieve distinct combinations. It's a lot faster than filtering all combinations and also memory safe!

def counter_combinations(g, n):
    if sum(v for k, v in g) < n or not n:
        yield []
    else:
        for i, (k, v) in enumerate(g):
            if v >= n:
                yield [k]*n
                v = n - 1
            for v in range(min(n, v), 0, -1):
                for j in counter_combinations(g[i + 1:], n - v):
                    rv = [k]*v + j
                    if len(rv) == n:
                        yield rv

Here is an example:

from collections import Counter

p = Counter('abracadabra').most_common()
print(p)
c = [_ for _ in counter_combinations(p, 4)]
print(c)
print(len(c))

Output:

[('a', 5), ('b', 2), ('r', 2), ('c', 1), ('d', 1)]
[['a', 'a', 'a', 'a'], ['a', 'a', 'a', 'b'], ['a', 'a', 'a', 'r'], ['a', 'a', 'a', 'c'], ['a', 'a', 'a', 'd'], ['a', 'a', 'b', 'b'], ['a', 'a', 'b', 'r'], ['a', 'a', 'b', 'c'], ['a', 'a', 'b', 'd'], ['a', 'a', 'r', 'r'], ['a', 'a', 'r', 'c'], ['a', 'a', 'r', 'd'], ['a', 'a', 'c', 'd'], ['a', 'b', 'b', 'r'], ['a', 'b', 'b', 'c'], ['a', 'b', 'b', 'd'], ['a', 'b', 'r', 'r'], ['a', 'b', 'r', 'c'], ['a', 'b', 'r', 'd'], ['a', 'b', 'c', 'd'], ['a', 'r', 'r', 'c'], ['a', 'r', 'r', 'd'], ['a', 'r', 'c', 'd'], ['b', 'b', 'r', 'r'], ['b', 'b', 'r', 'c'], ['b', 'b', 'r', 'd'], ['b', 'b', 'c', 'd'], ['b', 'r', 'r', 'c'], ['b', 'r', 'r', 'd'], ['b', 'r', 'c', 'd'], ['r', 'r', 'c', 'd']]
31
fbparis
  • 880
  • 1
  • 10
  • 23