Compare numpy array to every row of matrix to count similar items (vectorized)

Question

I'm playing a bingo and I choose 15 balls out of 50 possible balls (without replacement).

Then 30 balls get drawn (without replacement), and if my 15 balls are in this set of 30 balls drawn, I get a prize.

I wanted to calculate the probability of winning a prize by simulating this many times, preferably vectorized.

So here's my code until now:

import numpy as np

my_chosen_balls = np.random.choice(range(1,51), 15)
samples_30_balls = np.random.choice(range(1,51), (1_000_000, 30))

How do I compare the 15 balls that i chose with any of these 30 ball samples and see if all my balls were picked?
So compare my 15 balls to every one of the samples separately.

Mad Physicist · Accepted Answer · 2021-08-26T19:50:59.340

Here is a smaller example to visualize with:

my_chosen_balls = np.array([7, 4, 3])
sample_5_balls = np.array([[5, 5, 5, 6, 6, 4, 1, 1, 1, 8, 2, 3, 2, 8, 8],
                           [1, 9, 1, 3, 4, 8, 5, 4, 7, 2, 8, 6, 5, 6, 4],
                           [7, 3, 6, 9, 8, 3, 6, 9, 3, 1, 6, 5, 3, 1, 7],
                           [8, 4, 3, 2, 9, 5, 3, 8, 4, 6, 9, 2, 6, 5, 9],
                           [3, 2, 8, 5, 1, 9, 2, 5, 8, 4, 5, 1, 7, 4, 6]])

There are a couple of ways of doing this. Since you have only a single selection of 15, you can use np.isin:

mask = np.isin(sample_5_balls, my_chosen_balls).sum(0) == my_chosen_balls.size

If you want the percentage of successes:

np.count_nonzero(mask) / sample_5_balls.shape[1]

The problem is that you can't easily generate an array like samples_30_balls or sample_5_balls using tools like np.random.choice or np.random.Generator.choice. There are some solutions available, like Numpy random choice, replacement only along one axis, but they only work for a small number of items.

Instead, you can use sorting and slicing to get what you want, as shown here and here:

sample_30_balls = np.random.rand(50, 100000).argsort(0)[:30, :]

You will want to add 1 to the numbers for display, but it will be much easier to go zero-based for the remainder of the answer.

If your population size stays at 64 or under, you can use bit twiddling to make everything work much faster. First convert the data to a single array of numbers:

sample_30_bits = (1 << sample_30_balls).sum(axis=0)

These two operations are equivalent to

sample_30_bits = np.bitwise_or.reduce((2**sample_30_balls), axis=0)

A single sample is a single integer with this scheme:

my_chosen_bits = (1 << np.random.rand(50).argsort()[:15]).sum()

np.isin is now infintely simpler: it's just bitwise AND (&). You can use the fast bit_count function I wrote here (copied verbatim):

def bit_count(arr):
     # Make the values type-agnostic (as long as it's integers)
     t = arr.dtype.type
     mask = t(-1)
     s55 = t(0x5555555555555555 & mask)  # Add more digits for 128bit support
     s33 = t(0x3333333333333333 & mask)
     s0F = t(0x0F0F0F0F0F0F0F0F & mask)
     s01 = t(0x0101010101010101 & mask)

     arr = arr - ((arr >> 1) & s55)
     arr = (arr & s33) + ((arr >> 2) & s33)
     arr = (arr + (arr >> 4)) & s0F
     return (arr * s01) >> (8 * (arr.itemsize - 1))

pct = (bit_count(my_chosen_bits & sample_30_bits) == 15).sum() / sample_30_bits.size

But there's more: now you can generate a large number of samples not just for the 30 balls, but for the 15 as well. One alternative is to generate identical numbers of samples, and compare them 1-to-1:

N = 100000
sample_15_bits = (1 << np.random.rand(50, N).argsort(0)[:15, :]).sum(0)
sample_30_bits = (1 << np.random.rand(50, N).argsort(0)[:30, :]).sum(0)
pct = (bit_count(sample_15_bits & sample_30_bits) == 15).sum() / N

Another alternative is to generate potentially different arrays of samples for each quantity, and compare all of them against each other. This will require a lot more space in the result, so I will show it for smaller inputs:

M = 100
N = 5000
sample_15_bits = (1 << np.random.rand(50, M).argsort(0)[:15, :]).sum(0)
sample_30_bits = (1 << np.random.rand(50, N).argsort(0)[:30, :]).sum(0)
pct = (bit_count(sample_15_bits[:, None] & sample_30_bits) == 15).sum() / (M * N)

If you need to optimize for space (e.g., using truly large sample sizes), keep in mind that all the operations here use ufuncs except np.random.rand and argsort. You can therefore do most of the work in-place without creating temporary arrays. That will be left as an exercise for the reader.

Also, I recommend that you draw histograms of bit_count(sample_15_bits & sample_30_bits) to adjust your expectations. Here is a histogram of the counts for the last example above:

y = np.bincount(bit_count(sample_15_bits[:, None] & sample_30_bits).ravel())
x = np.arange(y.size)
plt.bar(x, y)

Notice how tiny the bar at 15 is. I've seen values of pct around 7e-5 while writing this answer, but am too lazy to figure out the theoretical value.

Michael Szczesny · Answer 2 · 2021-08-26T15:59:55.333

With isin count the intersecting values and compare with 15. I changed the data generation to samples without replacement.

import numpy as np
np.random.seed(10)

my_chosen_balls = np.random.choice(range(0,50), 15, replace=False)
samples_30_balls = np.random.rand(1_000_000,50).argsort(1)[:,:30]

(np.isin(samples_30_balls, my_chosen_balls).sum(1) == 15).sum()

Output

So about 0.007% chance.

How generating a data sample without replacement works

Generate random values in [0,1) in the shape samples, range. Here 10 samples from [0,1,2,3,4]

np.random.rand(10,5)

Out

array([[0.37216438, 0.16884495, 0.05393551, 0.68189535, 0.30378455],
       [0.63428637, 0.6566772 , 0.16162259, 0.16176099, 0.74568611],
       [0.81452942, 0.10470267, 0.89547322, 0.60099124, 0.22604322],
       [0.16562083, 0.89936513, 0.89291548, 0.95578207, 0.90790727],
       [0.11326867, 0.18230934, 0.44912596, 0.65437732, 0.78308136],
       [0.72693801, 0.22425798, 0.78157525, 0.93485338, 0.84097546],
       [0.96751432, 0.57735756, 0.48147214, 0.22441829, 0.53388467],
       [0.95415338, 0.07746658, 0.93875458, 0.21384035, 0.26350969],
       [0.39937711, 0.35182801, 0.74707871, 0.07335893, 0.27553172],
       [0.80749372, 0.40559599, 0.33654045, 0.14802479, 0.71198915]]

'Convert' to integers with argsort

np.random.rand(10,5).argsort(1)

Out

array([[4, 2, 1, 0, 3],
       [0, 1, 3, 2, 4],
       [1, 3, 2, 4, 0],
       [4, 0, 2, 3, 1],
       [2, 3, 0, 1, 4],
       [1, 4, 3, 2, 0],
       [4, 3, 2, 0, 1],
       [1, 0, 2, 3, 4],
       [4, 1, 2, 3, 0],
       [1, 4, 0, 2, 3]])

Slice to the desired sample size

np.random.rand(10,5).argsort(1)[:,:3]

Out

array([[2, 3, 4],
       [0, 4, 3],
       [3, 0, 4],
       [2, 0, 3],
       [2, 3, 4],
       [3, 4, 2],
       [2, 0, 1],
       [0, 4, 3],
       [0, 2, 3],
       [2, 3, 4]])

if i use this, then if I try how many have 16 balls in common, the answer should be 0, as I chose only 15 balls, but I get a non-zero answer then: (np.isin(samples_30_balls, my_chosen_balls).sum(1) == 16).sum() — Sander van den Oord, Aug 26 '21 at 15:21
i think my second line is wrong, i need to use replace=False, but this is not allowed for these kinds of 2D samples, so if you know a fix for that too, that would be great — Sander van den Oord, Aug 26 '21 at 15:25
replace=False doesn't work for this line: samples_30_balls = np.random.choice(range(1,51), (100, 30), replace=False) — Sander van den Oord, Aug 26 '21 at 15:27
I think you can take this an extra step towards speed by getting rid of `isin`. — Mad Physicist, Aug 26 '21 at 19:27

Compare numpy array to every row of matrix to count similar items (vectorized)

2 Answers2

How generating a data sample without replacement works