Here is a smaller example to visualize with:
my_chosen_balls = np.array([7, 4, 3])
sample_5_balls = np.array([[5, 5, 5, 6, 6, 4, 1, 1, 1, 8, 2, 3, 2, 8, 8],
[1, 9, 1, 3, 4, 8, 5, 4, 7, 2, 8, 6, 5, 6, 4],
[7, 3, 6, 9, 8, 3, 6, 9, 3, 1, 6, 5, 3, 1, 7],
[8, 4, 3, 2, 9, 5, 3, 8, 4, 6, 9, 2, 6, 5, 9],
[3, 2, 8, 5, 1, 9, 2, 5, 8, 4, 5, 1, 7, 4, 6]])
There are a couple of ways of doing this. Since you have only a single selection of 15, you can use np.isin
:
mask = np.isin(sample_5_balls, my_chosen_balls).sum(0) == my_chosen_balls.size
If you want the percentage of successes:
np.count_nonzero(mask) / sample_5_balls.shape[1]
The problem is that you can't easily generate an array like samples_30_balls
or sample_5_balls
using tools like np.random.choice
or np.random.Generator.choice
. There are some solutions available, like Numpy random choice, replacement only along one axis, but they only work for a small number of items.
Instead, you can use sorting and slicing to get what you want, as shown here and here:
sample_30_balls = np.random.rand(50, 100000).argsort(0)[:30, :]
You will want to add 1 to the numbers for display, but it will be much easier to go zero-based for the remainder of the answer.
If your population size stays at 64 or under, you can use bit twiddling to make everything work much faster. First convert the data to a single array of numbers:
sample_30_bits = (1 << sample_30_balls).sum(axis=0)
These two operations are equivalent to
sample_30_bits = np.bitwise_or.reduce((2**sample_30_balls), axis=0)
A single sample is a single integer with this scheme:
my_chosen_bits = (1 << np.random.rand(50).argsort()[:15]).sum()
np.isin
is now infintely simpler: it's just bitwise AND (&
). You can use the fast bit_count
function I wrote here (copied verbatim):
def bit_count(arr):
# Make the values type-agnostic (as long as it's integers)
t = arr.dtype.type
mask = t(-1)
s55 = t(0x5555555555555555 & mask) # Add more digits for 128bit support
s33 = t(0x3333333333333333 & mask)
s0F = t(0x0F0F0F0F0F0F0F0F & mask)
s01 = t(0x0101010101010101 & mask)
arr = arr - ((arr >> 1) & s55)
arr = (arr & s33) + ((arr >> 2) & s33)
arr = (arr + (arr >> 4)) & s0F
return (arr * s01) >> (8 * (arr.itemsize - 1))
pct = (bit_count(my_chosen_bits & sample_30_bits) == 15).sum() / sample_30_bits.size
But there's more: now you can generate a large number of samples not just for the 30 balls, but for the 15 as well. One alternative is to generate identical numbers of samples, and compare them 1-to-1:
N = 100000
sample_15_bits = (1 << np.random.rand(50, N).argsort(0)[:15, :]).sum(0)
sample_30_bits = (1 << np.random.rand(50, N).argsort(0)[:30, :]).sum(0)
pct = (bit_count(sample_15_bits & sample_30_bits) == 15).sum() / N
Another alternative is to generate potentially different arrays of samples for each quantity, and compare all of them against each other. This will require a lot more space in the result, so I will show it for smaller inputs:
M = 100
N = 5000
sample_15_bits = (1 << np.random.rand(50, M).argsort(0)[:15, :]).sum(0)
sample_30_bits = (1 << np.random.rand(50, N).argsort(0)[:30, :]).sum(0)
pct = (bit_count(sample_15_bits[:, None] & sample_30_bits) == 15).sum() / (M * N)
If you need to optimize for space (e.g., using truly large sample sizes), keep in mind that all the operations here use ufuncs except np.random.rand
and argsort
. You can therefore do most of the work in-place without creating temporary arrays. That will be left as an exercise for the reader.
Also, I recommend that you draw histograms of bit_count(sample_15_bits & sample_30_bits)
to adjust your expectations. Here is a histogram of the counts for the last example above:
y = np.bincount(bit_count(sample_15_bits[:, None] & sample_30_bits).ravel())
x = np.arange(y.size)
plt.bar(x, y)

Notice how tiny the bar at 15 is. I've seen values of pct
around 7e-5 while writing this answer, but am too lazy to figure out the theoretical value.