How can I choose a random sample of size n from values from a single pandas dataframe column, with repeating values occurring a maximum of 2 times?

Question

My dataframe looks like this:

Identifier       Strain     Other columns, etc.
1                  A
2                  C
3                  D
4                  B
5                  A
6                  C
7                  C
8                  B
9                  D
10                 A
11                 D
12                 D

I want to choose n rows at random while maintaining diversity in the strain values. For example, I want a group of 6, so I'd expect my final rows to include at least one of every type of strain with two strains appearing twice.

I've tried converting the Strain column into a numpy array and using the method random.choice but that didn't seem to run. I've also tried using .sample but it does not maximize strain diversity.

This is my latest attempt which outputs a sample of size 7 in order (identifiers 0-7) and the Strains are all the same.

randomsample = df[df.Strain == np.random.choice(df['Strain'].unique())].reset_index(drop=True)

Please supply the expected [minimal, reproducible example](https://stackoverflow.com/help/minimal-reproducible-example) (MRE). We should be able to copy and paste a contiguous block of your code, execute that file, and reproduce your problem along with tracing output for the problem points. This lets us test our suggestions against your test data and desired output. Please [include a minimal data frame](https://stackoverflow.com/questions/52413246/how-to-provide-a-reproducible-copy-of-your-dataframe-with-to-clipboard) as part of your MRE. — Prune, Mar 19 '21 at 22:36
We expect you to perform basic diagnosis to include with your post. At the very least, print the suspected, intermediate values at the point of error and trace them back to their sources. — Prune, Mar 19 '21 at 22:37
"didn't seem to run" is not an adequate description of a problem. Nor is your description of "sample". In any case your requirements may be hard to achieve, at least not without some cleverness. Random sampling with repeats or without is easy to specify. Random of everything but with a couple of repeats may require two samplings or some such trick. — hpaulj, Mar 19 '21 at 23:21
@hpaulj to be fair, I do think that OP's requirement is pretty clear **For example, I want a group of 6, so I'd expect my final rows to include at least one of every type of strain with two strains appearing twice.** i.e. trying to get the most even distribution possible of `Strain` in `n` samples. — Quang Hoang, Mar 20 '21 at 01:53

score 2 · Accepted Answer · answered Mar 20 '21 at 01:51

I believe there's something in numpy that does exactly this, but can't recall which. Here's a fairly fast approach:

Shuffle the data for randomness
enumerate the rows within each group
sort by the enumeration above
slice the top n rows

So in code:

n = 6

df = df.sample(frac=1)                      # step 1 
enums = df.groupby('Strain').cumcount()     # step 2 
        
orders = np.argsort(enums)                  # step 3
samples = df.iloc[orders[:n]]               # step 4

Output:

   Identifier Strain  Other columns, etc.
2           3      D                  NaN
7           8      B                  NaN
0           1      A                  NaN
5           6      C                  NaN
4           5      A                  NaN
8           9      D                  NaN

How can I choose a random sample of size n from values from a single pandas dataframe column, with repeating values occurring a maximum of 2 times?

1 Answers1