How to extend the iteration process into large size in Python 3

Question

After tossed a coin n times (n=10), there are 2^10=1024 possible outcomes. I used

lst = [list(i) for i in itertools.product([0, 1], repeat=n)]

to obtain n=10 all possible outcomes, and then I want to find out the number of group (m) which is defined as a maximum consecutive sequence for identical side of coin. For example, HHHTTHTTHT, the number of groups for this outcome is 6, which are (HHH)(TT)(H)(TT)(H)(T). I employed

group=[len([len(list(grp)) for k, grp in groupby(x)]) for x in lst]

to find out the number of groups (m) for each corresponding combination of 10 tossed coin. Finally, I obtain the number of possible combinations whose the number of groups is greater than 6 (m), such as,

Group6=len(list(filter(lambda x:x>m,group)))

However, when the number of tosses increases, for instance, (n=200,m=110) or (n=500, m=260). I still used the same code as the above in Python, but it time-consumed and I think it exceeded the memory size of python. Could someone please help me to figure out how to address this issue, when n and m are quite large? Thanks

For starters, you can use a generator expression instead of a list-comprehension for `group` and `lst`. You shouldn't be materializing the iterators into lists unless you need those values after. — juanpa.arrivillaga, Feb 14 '17 at 18:06
There is no need to **actually generate the tosses**: arithmetic and mathematical analysis reduces this to a combinatorial problem. — Willem Van Onsem, Feb 14 '17 at 18:10
However, the above will only help your memory usage, essentially, you are going to encounter combinatorial explosionsince you are brute-forcing your solution... so yeah. Use math? — juanpa.arrivillaga, Feb 14 '17 at 18:10
Thanks for your suggestion. I need to obtain the number of groups greater than a specific value, such as 110 for 200 tosses or 260 for 500 tosses, and the number of heads for each corresponding 2^200 combination and 2^500 combination, respectively. — TWord, Feb 14 '17 at 18:13
is a waste of memory and time building a list to just get its size, use instead `sum` like `sum(1 for _ in iterator)` — Copperfield, Feb 14 '17 at 18:16
I know it is a binomial trial for the number of heads (e.g. success outcome) in n times experiment. However, how to use math for the number of groups (which defined as a maximum consecutive sequence for identical side of coin for each of n times experiment) in such binomial trial — TWord, Feb 14 '17 at 18:20
Using a brute-force approach will not work when you want to iterate over 2^200 possible outcomes. Thats a number with 60 digits. If each combination could be processed in 1 nanosecond, it would still take something like 5x10^43 years, which is over 3x10^33 **times the age of the universe** — juanpa.arrivillaga, Feb 14 '17 at 18:21
you should ask in http://math.stackexchange.com/ instead, a mathematician may help you more than a programmer — Copperfield, Feb 14 '17 at 18:25
Yes, definitely, it is waste time to employ the method of a list-comprehension for iteration process for 2^200 or 2^500 possible outcomes. I only want to know the number of groups and corresponding heads greater than some specific values in those substantial large size of possible combinations — TWord, Feb 14 '17 at 18:29

score 1 · Answer 1 · edited May 23 '17 at 11:45

Iterating over all possible tosses is definitely not going to work for large n (and here n is already large from 20 I guess). One can only calculate this efficiently using some combinatorics.

Let's first start with a simple example: minimum two groups for four tosses: if we want to calculate the number of tosses that result in two our more groups. There are a few possibilities (we number the groups like 1, 2, etc.):

Two groups:

1112
1122
1222

Three groups:

1123
1223
1233

Four groups:

We know that the value of group G1 will be different then the one for G2 and the one for G2 different from G3 and so on. So it holds that:

G1 != G2 != G3 != ... != Gn

Since there are only two values (heads and tails) this means that if we determine the first group, we have deterimed the value for all groups. If G1 is heads, all even groups are tails and all odd groups are heads.

So this means that for every combination above, there are exactly two configurations. This means that for the case n=4,m=1, there are 2×7=14 configurations (which is exactly what we get with get when we work it out with the program in your question).

Now the only problem we still have to face is how we are going to count these super-configurations. We can do this by introducing what I would call upgrade notation:

You notate an increase of the group with 1 and the same group with 0.

So now 1223 becomes 0101: we upgrade the group on the second and the fourth index. And 1234 is 0111. How does this help? Well in fact for k groups, we only have to count the number of combinations with k-1 upgrades. So this means the number of combinations (k-1 nCr n-1). With n the length of the string (or the total number of tosses), and k the number of groups*. Now (k-1 nCr n) is defined as: (n-1)!/((k-1)!*(n-k)!). With ! the factorial. I borrowed a way to calculate nCr from here:

import operator as op from functools import reduce

def ncr(n, r):
    r = min(r, n-r)
    if r == 0: return 1
    return reduce(op.mul,range(n-r+1,n+1))//reduce(op.mul,range(1,r+1))

And now the last step is only to calculate the number of combinations (times two) for every k higher than m up to n. So:

def group_num(n,m):
    total = 0
    for i in range(m+1,n+1):
        total += ncr(n-1,i-1)
    return 2*total

or putting it into a one-liner:

def group_num(n,m):
    return 2*sum(ncr(n-1,i-1) for i in range(m+1,n+1))

Now we can test our code:

>>> group_num(4,1)
14
>>> group_num(10,6)
260
>>> group_num(200,110)
125409844583698900384745448848066934911164089598228258053888
>>> group_num(500,260)
606609270097725645141493459934317664675833205307408583743573981944035656294544972476175251556943050783856517254296239386495518290119646417132819099088

Which are the expected numbers (for the first two). As you can see the total amount blows up enormously so even the fastest algorithm to count the number of tosses one at a time will easily be outperformed by this approach (the results were calculated in less than a second).

The ncr function can be calculated in O(n) (but this can even be improved by using a memoize on the factorial for instance); and the group_num function is thus computed (without memoize optimization) in O((n-m)×n). Which thus completely outfactors the exponential behavior.

How to extend the iteration process into large size in Python 3

1 Answers1