A & B are 2 samples of size 50 created from a pool of 100 observations (with replacement). What is the probability that A and B would not have more than 10 observations in common? Or, how many such samples could be created so that A B have maximum 10 observations in common. e.g. Obs 1,...50 in A and 51,...100 in B, or 1,3,5,...,99 in A, 2,4,...,100 in B. Also, is it possible to repeat the same exercise for 8 samples of size 50 instead of only 2 samples? How do I calculate that in Excel or R?
-
seems like this question is better suited @https://stats.stackexchange.com/ – p._phidot_ Jun 04 '21 at 11:39
1 Answers
Since you asked this question here and not in Mathematics Stack Exchange I assume that you want to estimate the above probabilities using simulation rather than calculating the probabilities directly.
What is the probability that A and B would not have more than 10 observations in common?
In R, we can estimate this probability by obtaining 2 samples, each with 50 elements, checking if they contain more than 10 numbers in common, and repeating this process while keeping track of the iterations that meet this condition. The probability, using 1000000, iterations is estimated as follows:
# total iterations
n = 1000000
# Increases when samples don't share more than 10 observations
count = 0
# population
pop = 1:100
# Loop for checking the condition n times
for(i in 1:n){
# obtain 2 samples each of size 50 (It is assumed that values in
# each sample can not repeat and that there is replacement after each sample is obtained)
sam1 = sample(pop, 50)
sam2 = sample(pop, 50)
# Count values found in both samples
# Takes advantage of the fact that TRUE values can be used as 1s in R
total = sum(sam1 %in% sam2)
# Increase counter if there are 10 or less matches
if(total <= 10) {
count = count + 1
}
}
# Print the probability
print(count/n)
For your question about the 8 samples you can repeat the above code with 8 samples instead of 2 and to find the intersections do as described here.
Note: since there are a lot of possible sample pairs (choose(100, 50)^2
) and sets of eight samples (choose(100, 50)^8
) it is possible that these probabilities are so small to the point where simulation would require very large sample sizes in order to observe a pair that meets the problem's criteria.
Calculating the probabilities directly
I suggest you ask your quesiton in Mathematics Stack Exchange. I found this somewhat useful post there about a similar problem.

- 97
- 9