Multiple sequences of random numbers without replacement

Question

In numpy, I can use the code

from numpy.random import default_rng
rng = default_rng()
M, N, n = 10000, 1000, 3
rng.choice(np.arange(0, N), size=n, replace=False)

To get three random samples from 0 to 9 without replacement.

I would like to get thousands of such random sequences. What is the correct way to do this? I know I can do

np.array([rng.choice(np.arange(0, N), size=(n,), replace=False) for i in range(0, M)])

but I am wondering if there's a more efficient way to do this using numpy.

In this answer, the following way is recommended

np.argsort(rng.random((M,N)),axis=1)[:, :n]

which is superfast and elegant. However, the cost scales like N x M instead of n x M which I am hoping to achieve.

Are there any other methods out there?

When you speak of cost, are you worried about the memory or performance? — Divakar, Jun 29 '20 at 16:19
I am worried about both but mostly the computational performance (I can reduce memory by generating the random variables in batches). — Tohiko, Jun 29 '20 at 16:21
As suggested [`here`](https://stackoverflow.com/a/37012691/), you can use `np.random.rand(M,N).argpartition(n)[:,:n]`. With `argpartition` you would gain performance efficiency replacing `argsort`. Would that work for you? — Divakar, Jun 29 '20 at 16:24
Mm.. That still would scale like `N x M`. Is there no way to have something that does not scale this badly with the datum `N`? — Tohiko, Jun 29 '20 at 16:34

Divakar · Accepted Answer · 2020-06-30T10:46:47.530

Approach #1

For N >> n, we can use an iterative method with masking, so that at each iteration we pick one not-previously picked element per row. The implementation would look something like this -

R = np.arange(M)
mask = np.ones((M,N), dtype=bool)
idx = np.random.randint(0,N,(M))
mask[R,idx] = 0

for i in range(1,n):
    lim = N-i
    m2 = np.ones((M,lim), dtype=bool)
    idx2 = np.random.randint(0,lim,(M))
    m2[R,idx2] = 0
    mask[mask] = m2.ravel()

out = np.nonzero(~mask)[1].reshape(-1,n)

If you need to randomize numbers per row, use the rand-trick as linked in question post :

out = np.take_along_axis(out, np.random.rand(M,n).argsort(1), axis=1)

If the constant array-creation with m2 bothers you, re-use after initializing before looping, while keeping the rest of the code same -

m2 = np.ones((M,N-1), dtype=bool)
for i in range(1,n):
    lim = N-i
    idx2 = np.random.randint(0,lim,(M))
    m2[R,idx2] = 0
    mask[mask] = m2.ravel()
    m2[R,idx2] = 1
    m2 = m2[:,:-1]

Approach #2 Similar to Approach #1, but the initialization part does most of the job to setup unqiue random numbers per row. An additional while iterative part takes care of the rows that could not assign unique ones. With N >> n, we will hardly need to iterate though. The implementation would look something like this -

# https://stackoverflow.com/a/51915131/ @Divakar
def random_num_per_grp(L):
    # For each element in L pick a random number within range specified by it
    r1 = np.random.rand(np.sum(L)) + np.repeat(np.arange(len(L)),L)
    offset = np.r_[0,np.cumsum(L[:-1])]
    return r1.argsort()[offset] - offset

R = np.arange(M)
mask = np.ones((M,N), dtype=bool)
idx = np.random.randint(0,N,(M,n))
mask[R[:,None],idx] = 0

rows_notdone = mask.sum(1)!=N-n
while np.any(rows_notdone):    
    idx0 = random_num_per_grp(mask[rows_notdone].sum(1))
    steps = np.r_[0,mask.sum(1).cumsum()[:-1]]
    flat_idx0 = steps[rows_notdone] + idx0
    
    m2 = np.ones(mask.sum(), dtype=bool)
    m2[flat_idx0] = 0
    mask[mask] = m2
    
    rows_notdone = mask.sum(1)!=N-n

out = np.nonzero(~mask)[1].reshape(-1,n)

Thanks. I am marking this as the answer since it reduces the cost significantly. However, the cost scaling is still ‘MxN’ since the mask array has that size and around ‘M x (N-n)^2’ random variables are generated making it better used when ‘n’ is close to ‘N’. — Tohiko, Jul 02 '20 at 06:52
@Tohiko For the mask it's actually `MxN/8`, since its a boolean array, so has a much lesser memory footprint. How did you come up with ‘M x (N-n)^2’ random variables being used? As I said, for approach #2, you hardly need to go inside while loop. So, essentially apart from the mask, you have `np.random.randint(0,N,(M,n))` and `np.nonzero` part that generates again `Mxn` array. So, a total of mask + `Mx2*n`. That's all, essentially for the memory footprint. — Divakar, Jul 02 '20 at 07:06
Ah, sorry. I was referring to Approach #1. As far as I can see, Approach #2 has a similar average case complexity (worst case being infinite). — Tohiko, Jul 02 '20 at 09:55

Multiple sequences of random numbers without replacement

1 Answers1