4

Without replacement I'm choosing k elements from a sample n distinct times according to a specified distribution.

The iterative solution is simple:

for _ in range(n):
    np.random.choice(a, size=k, replace=False, p=p)

I can't set size=(k, n) because I would sample without replacement across samples. a and n are large, I hope for a vectorized solution.

Eric Kaschalk
  • 369
  • 1
  • 4
  • 8
  • I don't really understand to question –  Oct 12 '16 at 15:50
  • Consider `np.random.choice(np.arange(5), size=(3, 5), replace=False)`. This gives the error 'Cannot take a larger sample than population when replace=False'. What I want is to choose 3 values from `range(5)`, 5 times, each time without replacement. – Eric Kaschalk Oct 12 '16 at 15:55
  • could you edit with the code that gives you an error ? –  Oct 12 '16 at 16:02
  • how large are `a` and `n`? are you sure this is not a case of premature optimization? – Aaron Oct 12 '16 at 16:02
  • If you have large arrays, a loop may in fact be better in order to save on memory usage – Aaron Oct 12 '16 at 16:13
  • I should have specified that in the end I will operate on the arrays all at once - so I either will join the choices into a single `(k, n)` array or generate it all at once. I asked this question to see if the latter is possible. – Eric Kaschalk Oct 12 '16 at 16:24
  • You can preallocate the final result and store each iteration in successive columns. – Mad Physicist Oct 12 '16 at 16:27
  • @Aaron, he just gave you an example of a and n values 2 comments above your first one. – Hedwin Bonnavaud Nov 21 '21 at 09:56
  • @HedwinBonnavaud this was 5 years ago... evidently I commented before fully reading the comments, but like.... clearly it no longer matters. – Aaron Nov 21 '21 at 18:43

2 Answers2

2

So the full iterative solution is:

In [158]: ll=[]
In [159]: for _ in range(10):
     ...:     ll.append(np.random.choice(5,3)) 
In [160]: ll
Out[160]: 
[array([3, 2, 4]),
 array([1, 1, 3]),
 array([0, 3, 1]),
 ...
 array([0, 3, 0])]
In [161]: np.array(ll)
Out[161]: 
array([[3, 2, 4],
       [1, 1, 3],
       ...
       [3, 0, 1],
       [4, 4, 2],
       [0, 3, 0]])

That could be cast as list comprehension: np.array([np.random.choice(5,3) for _ in range(10)]).

Or an equivalent where you A=np.zeros((10,3),int) and A[i,:]=np.random...

In other words you want choices from range(5), but want them to be unique only within rows.

The np.random.choice docs suggest an alternative:

>>> np.random.choice(5, 3, replace=False)
array([3,1,0])
>>> #This is equivalent to np.random.permutation(np.arange(5))[:3]

I'm wondering if I can generate

array([[0, 1, 2, 3, 4],
       [0, 1, 2, 3, 4],
       [0, 1, 2, 3, 4],
       ...
       [0, 1, 2, 3, 4],
       [0, 1, 2, 3, 4]])

and permute values within rows. But with permute I can only shuffle all the columns together. So I'm still stuck with iterating on rows to produce the choice without replacement.

hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • Nevermind, found [this](https://stackoverflow.com/questions/51279464/sampling-unique-column-indexes-for-each-row-of-a-numpy-array) which you pointed out – Josmoor98 Jan 29 '20 at 15:07
  • "np.random.choice(5, 3, replace=False) is equivalent to np.random.permutation(np.arange(5))[:3]", is it also equivalent in terms of computation time ? – Hedwin Bonnavaud Nov 21 '21 at 10:00
1

Here are a couple of suggestions.

  1. You can preallocate the (n, k) output array, then do the choice multiple times:

    result = np.zeros((n, k), dtype=a.dtype)
    for row in range(n):
        result[row, :] = np.random.choice(a, size=k, replace=False, p=p)
    
  2. You can precompute the n * k selection indices and then apply them to a all at once. Since you want to sample the indices without replacement, you will want to use np.choice in a loop again:

    indices = np.concatenate([np.random.choice(a.size, size=k, replace=False, p=p) for _ in range(n)])
    result = a[indices].reshape(n, k)
    
Mad Physicist
  • 107,652
  • 25
  • 181
  • 264