1

I have a really big list. Imagine it looks something like this:

test = ['llama', 'cow', 'horse', 'fish', 'sheep', 'goat', 'cat', 'dog']

I want to sample out of this list many times. I want each sample to be taken without replacement. I want to avoid for loops in this case.

I've seen many solutions on StackOverflow that are close, but not exactly what I need here. Let's say each sample I wanted was to be of size 3. If I wanted to sample with replacement, this would work:

np.random.choice(test, size=(100, 3))

This would give me 100 rows with a sample of 3 in each row. The problem is that any particular row might have repeats, and I can't ask it to sample without replacement, because 300 > len(test).

Is there a way around this that maintains randomness? I saw potential solutions that use np.argsort, but I'm not sure that they're still actually random, considering sorting is being done.

John Rouhana
  • 538
  • 3
  • 15
  • Don't you want to just encapsulate the two possibilities (sample shorter or longer than the origin list) in a custom function containing an if/else statement ? – V. Déhaye Dec 21 '18 at 21:58
  • I mean, that's really not the big issue for me; I'll edit it out of the question. The big issue is just that I want to take many samples from the same list, and each sample needs to be without replacement. Other languages I've used have this functionality built into their functions. I'm trying to see if I'm missing something here. – John Rouhana Dec 21 '18 at 22:00
  • So in your example you'd like each row without replacement, but you don't care if one row is the same as another one right ? – V. Déhaye Dec 21 '18 at 22:03
  • Precisely, that's what I'm looking for. – John Rouhana Dec 21 '18 at 22:04

3 Answers3

3

You can use random.sample for that, from the documentation:

Return a k length list of unique elements chosen from the population sequence. Used for random sampling without replacement.

And repeat the process n_times using a list comprehension:

n_times = 100
n_sample = 3
[random.sample(test, n_sample) for i in range(n_times)]

[['llama', 'goat', 'sheep'],
 ['cat', 'horse', 'dog'],
 ['sheep', 'dog', 'goat'],
 ['cat', 'cow', 'llama'],
 ['dog', 'fish', 'horse'],
 ['llama', 'horse', 'cow'],
 ['dog', 'goat', 'cow'],
 ['llama', 'cow', 'sheep'],
 ['fish', 'dog', 'horse'],
 ... 
yatu
  • 86,083
  • 12
  • 84
  • 139
  • 1
    This is basically what I have implemented right now. My only problem with it is that it's actually too slow for the purposes of what I'm doing. I was wondering if there was an implementation somewhere that I was missing that would do what I need faster than this. Upvoted, but holding out for other answers. – John Rouhana Dec 24 '18 at 15:22
2

Here's a vectorized approach with rand+argsort/argpartition trick from here -

idx = np.random.rand(100, len(test)).argpartition(3,axis=1)[:,:3]
out = np.take(test, idx)

Let's verify that all are unique per row with some pandas help -

In [51]: idx = np.random.rand(100, len(test)).argpartition(3,axis=1)[:,:3]
    ...: out = np.take(test, idx)

In [52]: import pandas as pd

In [53]: (pd.DataFrame(out).nunique(axis=1).values==3).all()
Out[53]: True
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • Are samples taken this way independent of each other? I stumbled upon your response to the other thread, but I don't really follow what this method is actually doing. – John Rouhana Dec 24 '18 at 17:05
  • 1
    @JohnRouhana Yes, they are independent, because that `argpartition(3,axis=1)` is operating along each row independently. `argsort` or `argpartition` achieve the goal of unique indices per row, i.e. no replacement per row. – Divakar Dec 24 '18 at 17:08
  • I've observed a marginal speedup compared to what I had, so I'm accepting this answer. Thanks. – John Rouhana Dec 24 '18 at 18:25
0

You could run np.random.choice without replacement one time for each row, and put the results in a matrix. That can be done with this command.

np.array([np.random.choice(test, 3, replace=False) for i in range(100)])
Atnas
  • 594
  • 4
  • 16