1

In numpy, I can use the code

from numpy.random import default_rng
rng = default_rng()
M, N, n = 10000, 1000, 3
rng.choice(np.arange(0, N), size=n, replace=False)

To get three random samples from 0 to 9 without replacement.

I would like to get thousands of such random sequences. What is the correct way to do this? I know I can do

np.array([rng.choice(np.arange(0, N), size=(n,), replace=False) for i in range(0, M)])

but I am wondering if there's a more efficient way to do this using numpy.

In this answer, the following way is recommended

np.argsort(rng.random((M,N)),axis=1)[:, :n]

which is superfast and elegant. However, the cost scales like N x M instead of n x M which I am hoping to achieve.

Are there any other methods out there?

Tohiko
  • 1,860
  • 2
  • 18
  • 26
  • When you speak of cost, are you worried about the memory or performance? – Divakar Jun 29 '20 at 16:19
  • I am worried about both but mostly the computational performance (I can reduce memory by generating the random variables in batches). – Tohiko Jun 29 '20 at 16:21
  • As suggested [`here`](https://stackoverflow.com/a/37012691/), you can use `np.random.rand(M,N).argpartition(n)[:,:n]`. With `argpartition` you would gain performance efficiency replacing `argsort`. Would that work for you? – Divakar Jun 29 '20 at 16:24
  • Mm.. That still would scale like `N x M`. Is there no way to have something that does not scale this badly with the datum `N`? – Tohiko Jun 29 '20 at 16:34

1 Answers1

1

Approach #1

For N >> n, we can use an iterative method with masking, so that at each iteration we pick one not-previously picked element per row. The implementation would look something like this -

R = np.arange(M)
mask = np.ones((M,N), dtype=bool)
idx = np.random.randint(0,N,(M))
mask[R,idx] = 0

for i in range(1,n):
    lim = N-i
    m2 = np.ones((M,lim), dtype=bool)
    idx2 = np.random.randint(0,lim,(M))
    m2[R,idx2] = 0
    mask[mask] = m2.ravel()

out = np.nonzero(~mask)[1].reshape(-1,n)

If you need to randomize numbers per row, use the rand-trick as linked in question post :

out = np.take_along_axis(out, np.random.rand(M,n).argsort(1), axis=1)

If the constant array-creation with m2 bothers you, re-use after initializing before looping, while keeping the rest of the code same -

m2 = np.ones((M,N-1), dtype=bool)
for i in range(1,n):
    lim = N-i
    idx2 = np.random.randint(0,lim,(M))
    m2[R,idx2] = 0
    mask[mask] = m2.ravel()
    m2[R,idx2] = 1
    m2 = m2[:,:-1]

Approach #2 Similar to Approach #1, but the initialization part does most of the job to setup unqiue random numbers per row. An additional while iterative part takes care of the rows that could not assign unique ones. With N >> n, we will hardly need to iterate though. The implementation would look something like this -

# https://stackoverflow.com/a/51915131/ @Divakar
def random_num_per_grp(L):
    # For each element in L pick a random number within range specified by it
    r1 = np.random.rand(np.sum(L)) + np.repeat(np.arange(len(L)),L)
    offset = np.r_[0,np.cumsum(L[:-1])]
    return r1.argsort()[offset] - offset

R = np.arange(M)
mask = np.ones((M,N), dtype=bool)
idx = np.random.randint(0,N,(M,n))
mask[R[:,None],idx] = 0

rows_notdone = mask.sum(1)!=N-n
while np.any(rows_notdone):    
    idx0 = random_num_per_grp(mask[rows_notdone].sum(1))
    steps = np.r_[0,mask.sum(1).cumsum()[:-1]]
    flat_idx0 = steps[rows_notdone] + idx0
    
    m2 = np.ones(mask.sum(), dtype=bool)
    m2[flat_idx0] = 0
    mask[mask] = m2
    
    rows_notdone = mask.sum(1)!=N-n

out = np.nonzero(~mask)[1].reshape(-1,n)
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • Thanks. I am marking this as the answer since it reduces the cost significantly. However, the cost scaling is still ‘MxN’ since the mask array has that size and around ‘M x (N-n)^2’ random variables are generated making it better used when ‘n’ is close to ‘N’. – Tohiko Jul 02 '20 at 06:52
  • @Tohiko For the mask it's actually `MxN/8`, since its a boolean array, so has a much lesser memory footprint. How did you come up with ‘M x (N-n)^2’ random variables being used? As I said, for approach #2, you hardly need to go inside while loop. So, essentially apart from the mask, you have `np.random.randint(0,N,(M,n))` and `np.nonzero` part that generates again `Mxn` array. So, a total of mask + `Mx2*n`. That's all, essentially for the memory footprint. – Divakar Jul 02 '20 at 07:06
  • Ah, sorry. I was referring to Approach #1. As far as I can see, Approach #2 has a similar average case complexity (worst case being infinite). – Tohiko Jul 02 '20 at 09:55