Statistics: Optimizing probability calculations within python

Question

Setup:

The question is complex form of a classic probability question:

70 colored balls are placed in an urn, 10 for each of the seven rainbow colors.

What is the expected number of distinct colors in 20 randomly picked balls?

My solution is python's itertools library: combos = itertools.combinations(urn, 20), print sum([1 for x in combos]) (where urn is a list of the 70 balls in the urn).

I can unpack the iterator up to a length of combinations(urn, 8) past that my computer can't handle it.

Note: I know this wouldn't give me the answer, this is only the road block in my script, in other words if this worked my script would work.

Question: How could I find the expected colors accurately, without the worlds fastest super computer? Is my way even computationally possible?

https://projecteuler.net/problem=493 ? If so we could add the `[project-euler]` tag. (Brute force tends not to work for the higher PE questions - you need a mathematical/combinatorial solution). — Alex Riley, Jan 12 '15 at 13:58
The only reason I didn't at the tag was due to the tag description: "**DO NOT USE THIS TAG** Project Euler is a collection of mathematical programming problems of varying difficulty." — PVNRT, Jan 12 '15 at 14:06
Ah - that wasn't there last time I checked... I guess you were right to not use it. — Alex Riley, Jan 12 '15 at 14:07
There is a 1-line mathematical solution based on rewriting the count as a sum, then using the linearity of expectation. — Douglas Zare, Jan 12 '15 at 15:48

score 14 · Accepted Answer · edited May 23 '17 at 12:10

Since a couple of people have asked to see the mathematical solution, I'll give it. This is one of the Project Euler problems that can be done in a reasonable amount of time with pencil and paper. The answer is

7(1 - (60 choose 20)/(70 choose 20))

To get this write X, the count of colors present, as a sum X0+X1+X2+...+X6, where Xi is 1 if the ith color is present, and 0 if it is not present.

E(X) 
= E(X0+X1+...+X6) 
= E(X0) + E(X1) + ... + E(X6)        by linearity of expectation
= 7E(X0)                             by symmetry
= 7 * probability that a particular color is present
= 7 * (1- probability that a particular color is absent)
= 7 * (1 - (# ways to pick 20 avoiding a color)/(# ways to pick 20))
= 7 * (1 - (60 choose 20)/(70 choose 20))

Expectation is always linear. So, when you are asked to find the average value of some random quantity, it often helps to try to rewrite the quantity as a sum of simpler pieces such as indicator (0-1) random variables.

This does not say how to make the OP's approach work. Although there is a direct mathematical solution, it is good to know how to iterate through the cases in an organized and practicable fashion. This could help if you next wanted a more complicated function of the set of colors present than the count. Duffymo's answer suggested something that I'll make more explicit:

You can break up the ways to draw 20 calls from 70 into categories indexed by the counts of colors. For example, the index (5,5,10,0,0,0,0) means we drew 5 of the first color, 5 of the second color, 10 of the third color, and none of the other colors.

The set of possible indices is contained in the collection of 7-tuples of nonnegative integers with sum 20. Some of these are impossible, such as (11,9,0,0,0,0,0) by the problem's assumption that there are only 10 balls of each color, but we can deal with that. The set of 7-tuples of nonnegative numbers adding up to 20 has size (26 choose 6)=230230, and it has a natural correspondence with the ways of choosing 6 dividers among 26 spaces for dividers or objects. So, if you have a way to iterate through the 6 element subsets of a 26 element set, you can convert these to iterate through all indices.

You still have to weight the cases by the counts of the ways to draw 20 balls from 70 to get that case. The weight of (a0,a1,a2,...,a6) is (10 choose a0)(10 choose a1)...*(10 choose a6). This handles the case of impossible indices gracefully, since 10 choose 11 is 0 so the product is 0.

So, if you didn't know about the mathematical solution by the linearity of expectation, you could iterate through 230230 cases and compute a weighted average of the number of nonzero coordinates of the index vector, weighted by a product of small binomial terms.

Thank you for this clear solution Douglas. Your original comment prompted me to solve the question with a spreadsheet, so I don't feel bad about reading your solution. — PVNRT, Jan 14 '15 at 14:40

score 1 · Answer 2 · answered Jan 12 '15 at 17:34

1

Wouldn't it just be combinations with repetition?

http://www.mathsisfun.com/combinatorics/combinations-permutations.html

answered Jan 12 '15 at 17:34

duffymo

305,152
44
369
561

No, I believe you are talking about the stars-and-bars method for counting the number of ways to choose, say, 20 unordered objects in 7 categories. However, these are not all given equal probability when you take 20 objects from an urn with 10 of each color. You can't get 11 of one color, and there is only one way to get 10 of the first two colors, but (10 choose 5)*(10 choose 5) ways to get 10 of the first color, then 5 of the second and third colors. – Douglas Zare Jan 13 '15 at 05:49
I don't know what stars and bars method is. The link gives the formula I have in mind. – duffymo Jan 13 '15 at 10:09
The stars and bars method is the standard name for what is called "combinations with repetitions" in the link you gave. http://en.wikipedia.org/wiki/Stars_and_bars_%28combinatorics%29 I don't know how you are going from that to the probability requested in this problem. Which formula in your link do you mean? The count for the number of ways to divide 20 objects among 7 categories is not relevant. – Douglas Zare Jan 13 '15 at 10:44
Maybe not the perfect answer, but it ought to be applicable as a bound. I realize that there are only 10 of each color, so the user is drawing from a bag without replacement. I think there's value in doing something simpler and then refining. – duffymo Jan 13 '15 at 13:27
It's not a bound because there are errors in both directions. What you can do is iterate over the 26 choose 6 cases, weighting them by some probability that will be 0 for the impossible cases. The computation of that probability is more complicated than the original problem, but perhaps this iteration is what the OP wants. – Douglas Zare Jan 13 '15 at 14:08
Maybe you're right. You've got some good things to say in comments to all the answers. I'm wondering why you don't submit a working answer yourself? You seem to know more about this than either of the folks that have attempted a solution. – duffymo Jan 13 '15 at 14:09
If this were a math question I could answer it, but that wouldn't answer the programming problem, and I haven't used Python in years. – Douglas Zare Jan 13 '15 at 15:06
"There is a 1-line mathematical solution based on rewriting the count as a sum, then using the linearity of expectation." - I'd like to see you write it in any language you were fluent in. – duffymo Jan 13 '15 at 15:59
@DouglasZare I too would like to see the 1-line solution. The best I can do is to work with integer partitions of 20. This reduces the combinatorics to something manageable, and it is exact, but it is not an explicit solution. – Robert Dodier Jan 13 '15 at 20:53
Not clear why it's linear. How do you know that? What if it's not? – duffymo Jan 13 '15 at 20:54

score -2 · Answer 3 · answered Jan 12 '15 at 14:53

-2

Make an urn with 10 of each color.
Decide on the number of trials you want.
Make a container to hold the result of each trial
for each trial, pick a random sample of twenty items from the urn, make a set of those items, add the length of that set to the results.
find the average of the results

answered Jan 12 '15 at 14:53

wwii

23,232
7
37
77

1

How many random trials would you recommend to get the required accuracy of 10^-9? It looks like about 10^18 trials to me, so I think another method should be used. – Douglas Zare Jan 12 '15 at 15:53
@DouglasZare I didn't see the accuracy requirement in the question - that's a bit overkill. – wwii Jan 13 '15 at 05:00
@DouglasZare Just out of curiosity, how did you compute the number of trials required to reach the given accuracy? – rubik Feb 27 '15 at 20:52
@rubik: You expect that to get an accuracy of 1/n, it will take about cn^2 trials for some c. I estimated the value of c (as roughly 1) by making the simplifying assumption that the indicators variables (for the presence of each color) are independent. The variance of a sum of independent random variables is the sum of the variances, so this independence assumption makes the calculation of the variance of the count simple, 7p(1-p) where p is .974, the probability that each color is included. The standard deviation is the square root of 7p(1-p), 0.42, close enough to 1/2 or 1. – Douglas Zare Feb 27 '15 at 21:12
You can bound how important the error is from the independence assumption on the calculation of the standard deviation by comparing with a sum of indicators that are as anti-correlated as possible. It doesn't make much of a difference. We assumed the correlation was 0, the actual correlation is a small negative number, and making this as negative as possible produces another easily analyzed random variable whose standard deviation is 0.43 instead. – Douglas Zare Feb 27 '15 at 21:16
@DouglasZare Wow thank you! That's really informative. – rubik Feb 27 '15 at 21:18

Statistics: Optimizing probability calculations within python

3 Answers3