16

So I am wondering if there's a more efficient solution in generating a 2-D array using np.random.choice where each row has unique values.

For example, for an array with shape (3,4), we expect an output of:

# Expected output given a shape (3,4)
array([[0, 1, 3, 2],
       [2, 3, 1, 0],
       [1, 3, 2, 0]])

This means that the values for each row must be unique with respect to the number of columns. So for each row in out, the integers should only fall between 0 to 3.

I know that I can achieve it by passing False to the replace argument. But I can only do it for each row and not for the whole matrix. For instance, I can do this:

>>> np.random.choice(4, size=(1,4), replace=False)
array([[0,2,3,1]])

But when I try to do this:

>>> np.random.choice(4, size=(3,4), replace=False)

I get an error like this:

 File "<stdin>", line 1, in <module>
 File "mtrand.pyx", line 1150, in mtrand.RandomState.choice 
 (numpy\random\mtrand\mtrand.c:18113)
 ValueError: Cannot take a larger sample than population when 
 'replace=False'

I assume it's because it's trying to draw 3 x 4 = 12 samples due to the size of the matrix without replacement but I'm only giving it a limit of 4.

I know that I can solve it by using a for-loop:

 >>> a = (np.random.choice(4,size=4,replace=False) for _ in range(3))
 >>> np.vstack(a)
 array([[3, 1, 2, 0],
        [1, 2, 0, 3],
        [2, 0, 3, 1]])

But I wanted to know if there's a workaround without using any for-loops? (I'm kinda assuming that adding for-loops might make it slower if I have a number of rows greater than 1000. But as you can see I am actually creating a generator in a so I'm also not sure if it has an effect after all.)

Lj Miranda
  • 368
  • 2
  • 10

3 Answers3

27

One trick I have used often is generating a random array and using argsort to get unique indices as the required unique numbers. Thus, we could do -

def random_choice_noreplace(m,n, axis=-1):
    # m, n are the number of rows, cols of output
    return np.random.rand(m,n).argsort(axis=axis)

Sample runs -

In [98]: random_choice_noreplace(3,7)
Out[98]: 
array([[0, 4, 3, 2, 6, 5, 1],
       [5, 1, 4, 6, 0, 2, 3],
       [6, 1, 0, 4, 5, 3, 2]])

In [99]: random_choice_noreplace(5,7, axis=0) # unique nums along cols
Out[99]: 
array([[0, 2, 4, 4, 1, 0, 2],
       [1, 4, 3, 2, 4, 1, 3],
       [3, 1, 1, 3, 2, 3, 0],
       [2, 3, 0, 0, 0, 2, 4],
       [4, 0, 2, 1, 3, 4, 1]])

Runtime test -

# Original approach
def loopy_app(m,n):
    a = (np.random.choice(n,size=n,replace=False) for _ in range(m))
    return np.vstack(a)

Timings -

In [108]: %timeit loopy_app(1000,100)
10 loops, best of 3: 20.6 ms per loop

In [109]: %timeit random_choice_noreplace(1000,100)
100 loops, best of 3: 3.66 ms per loop
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • Brilliant! Thank you so much! – Lj Miranda Aug 01 '17 at 13:10
  • Hang on, maybe this isn't genius, unless I'm missing something... if I'm sampling `x` without replacement, then that implies I am taking a sample size that is ` – pretzlstyle Oct 25 '17 at 21:59
  • @jphollowed This basically generates unique numbers covering the entire length. As such, a good use-case would be to shuffle an array. Now, if you want to take no. of samples < length of array, simply slice `random_choice_noreplace(1000,100)[:,:20]`, where `100` is the length of the array and you want 20 samples. – Divakar Oct 26 '17 at 03:19
3

Here is my solution to repeated sampling without replacement, modified based on Divakar's answer. In his comment section, he suggested slicing the result if no. of samples < length of array. However, this may not be the most efficient method if length of array is large but no. of samples is small, because argsort can take a long time. I suggest using argpartition instead.

def random_choice_noreplace2(l, n_sample, num_draw):
    '''
    l: 1-D array or list
    n_sample: sample size for each draw
    num_draw: number of draws

    Intuition: Randomly generate numbers, get the index of the smallest n_sample number for each row.
    '''
    l = np.array(l)
    return l[np.argpartition(np.random.rand(num_draw,len(l)), n_sample-1,axis=-1)[:,:n_sample]]

Timings -

def loopy_app(l, n_sample, num_draw):
    l = np.array(l)
    a = [np.random.choice(l,size=n_sample,replace=False) for _ in range(num_draw)]
    return np.vstack(a)

def random_choice_noreplace(l, n_sample, num_draw):
    # m, n are the number of rows, cols of output
    l = np.array(l)
    return l[np.random.rand(num_draw,len(l)).argsort(axis=-1)[:,:n_sample]]

In [2]: %timeit loopy_app(range(100),2,1000)   
48.5 ms ± 2.91 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [3]: %timeit random_choice_noreplace(range(100),2,1000)   
7.8 ms ± 210 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [4]: %timeit random_choice_noreplace2(range(100),2,1000)   
2.71 ms ± 57.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Correctness -

In [5]: np.random.seed(42)      
   ...: random_choice_noreplace(range(100),2,10)                                                                                                          
Out[5]: 
array([[72, 10],
       [28, 71],
       [ 8,  5],
       [32, 71],
       [ 7, 56],
       [63, 15],
       [40, 28],
       [94, 64],
       [21, 98],
       [45, 36]])

In [6]: np.random.seed(42)      
   ...: random_choice_noreplace2(range(100),2,10)                                                                                                          
Out[6]: 
array([[72, 10],
       [28, 71],
       [ 8,  5],
       [32, 71],
       [ 7, 56],
       [63, 15],
       [40, 28],
       [94, 64],
       [21, 98],
       [45, 36]])
Lala La
  • 1,352
  • 9
  • 18
  • Yup, for the specific case of drawing a smaller set, this makes sense. It's covered here - https://stackoverflow.com/questions/45881540/, https://stackoverflow.com/questions/35572381/, among others. – Divakar Dec 13 '19 at 19:31
  • Thanks! I have been looking for solutions to this problem. This seems to be a useful feature to add to numpy.random. Do you think it worth a PR? – Lala La Dec 13 '19 at 23:01
  • 1
    Well generating the random numbers is a work-around, not exactly a direct solution. So, don't think the devs will be very excited about this one. – Divakar Dec 14 '19 at 04:40
1

In addition to Divakar's nice answer, here is another alternative that is even faster by roughly 20% on my machine:

def random_choice_noreplace_2(m, n):
    data = np.arange(m * n).reshape(n, m) % m
    for row in data: np.random.shuffle(row)
    return data

Timings:

In [3]: %timeit random_choice_noreplace(1000, 100)
3.85 ms ± 1.52 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [4]: %timeit random_choice_noreplace_2(1000, 100)
3.1 ms ± 10.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
piripiri
  • 1,925
  • 2
  • 18
  • 35