numpy take many samples with no replacement by row

Question

I have a really big list. Imagine it looks something like this:

test = ['llama', 'cow', 'horse', 'fish', 'sheep', 'goat', 'cat', 'dog']

I want to sample out of this list many times. I want each sample to be taken without replacement. I want to avoid for loops in this case.

I've seen many solutions on StackOverflow that are close, but not exactly what I need here. Let's say each sample I wanted was to be of size 3. If I wanted to sample with replacement, this would work:

np.random.choice(test, size=(100, 3))

This would give me 100 rows with a sample of 3 in each row. The problem is that any particular row might have repeats, and I can't ask it to sample without replacement, because 300 > len(test).

Is there a way around this that maintains randomness? I saw potential solutions that use np.argsort, but I'm not sure that they're still actually random, considering sorting is being done.

Don't you want to just encapsulate the two possibilities (sample shorter or longer than the origin list) in a custom function containing an if/else statement ? — V. Déhaye, Dec 21 '18 at 21:58
I mean, that's really not the big issue for me; I'll edit it out of the question. The big issue is just that I want to take many samples from the same list, and each sample needs to be without replacement. Other languages I've used have this functionality built into their functions. I'm trying to see if I'm missing something here. — John Rouhana, Dec 21 '18 at 22:00
So in your example you'd like each row without replacement, but you don't care if one row is the same as another one right ? — V. Déhaye, Dec 21 '18 at 22:03

yatu · Answer 1 · 2018-12-21T22:38:22.560

3

You can use random.sample for that, from the documentation:

Return a k length list of unique elements chosen from the population sequence. Used for random sampling without replacement.

And repeat the process n_times using a list comprehension:

n_times = 100
n_sample = 3
[random.sample(test, n_sample) for i in range(n_times)]

[['llama', 'goat', 'sheep'],
 ['cat', 'horse', 'dog'],
 ['sheep', 'dog', 'goat'],
 ['cat', 'cow', 'llama'],
 ['dog', 'fish', 'horse'],
 ['llama', 'horse', 'cow'],
 ['dog', 'goat', 'cow'],
 ['llama', 'cow', 'sheep'],
 ['fish', 'dog', 'horse'],
 ...

edited Dec 21 '18 at 22:38

answered Dec 21 '18 at 22:25

yatu

86,083
12
84
139

1

This is basically what I have implemented right now. My only problem with it is that it's actually too slow for the purposes of what I'm doing. I was wondering if there was an implementation somewhere that I was missing that would do what I need faster than this. Upvoted, but holding out for other answers. – John Rouhana Dec 24 '18 at 15:22

Divakar · Accepted Answer · 2018-12-22T05:03:11.033

2

Here's a vectorized approach with rand+argsort/argpartition trick from here -

idx = np.random.rand(100, len(test)).argpartition(3,axis=1)[:,:3]
out = np.take(test, idx)

Let's verify that all are unique per row with some pandas help -

In [51]: idx = np.random.rand(100, len(test)).argpartition(3,axis=1)[:,:3]
    ...: out = np.take(test, idx)

In [52]: import pandas as pd

In [53]: (pd.DataFrame(out).nunique(axis=1).values==3).all()
Out[53]: True

edited Dec 22 '18 at 05:03

answered Dec 22 '18 at 04:57

Divakar

218,885
19
262
358

Are samples taken this way independent of each other? I stumbled upon your response to the other thread, but I don't really follow what this method is actually doing. – John Rouhana Dec 24 '18 at 17:05
1

@JohnRouhana Yes, they are independent, because that `argpartition(3,axis=1)` is operating along each row independently. `argsort` or `argpartition` achieve the goal of unique indices per row, i.e. no replacement per row. – Divakar Dec 24 '18 at 17:08
I've observed a marginal speedup compared to what I had, so I'm accepting this answer. Thanks. – John Rouhana Dec 24 '18 at 18:25

score 0 · Answer 3 · answered Dec 21 '18 at 22:25

0

You could run np.random.choice without replacement one time for each row, and put the results in a matrix. That can be done with this command.

np.array([np.random.choice(test, 3, replace=False) for i in range(100)])

answered Dec 21 '18 at 22:25

Atnas

594
4
16

numpy take many samples with no replacement by row

3 Answers3

Linked

Related