2

My dataframe looks like this:

Identifier       Strain     Other columns, etc.
1                  A
2                  C
3                  D
4                  B
5                  A
6                  C
7                  C
8                  B
9                  D
10                 A
11                 D
12                 D

I want to choose n rows at random while maintaining diversity in the strain values. For example, I want a group of 6, so I'd expect my final rows to include at least one of every type of strain with two strains appearing twice.

I've tried converting the Strain column into a numpy array and using the method random.choice but that didn't seem to run. I've also tried using .sample but it does not maximize strain diversity.

This is my latest attempt which outputs a sample of size 7 in order (identifiers 0-7) and the Strains are all the same.

randomsample = df[df.Strain == np.random.choice(df['Strain'].unique())].reset_index(drop=True)
  • 1
    Please supply the expected [minimal, reproducible example](https://stackoverflow.com/help/minimal-reproducible-example) (MRE). We should be able to copy and paste a contiguous block of your code, execute that file, and reproduce your problem along with tracing output for the problem points. This lets us test our suggestions against your test data and desired output. Please [include a minimal data frame](https://stackoverflow.com/questions/52413246/how-to-provide-a-reproducible-copy-of-your-dataframe-with-to-clipboard) as part of your MRE. – Prune Mar 19 '21 at 22:36
  • 1
    We expect you to perform basic diagnosis to include with your post. At the very least, print the suspected, intermediate values at the point of error and trace them back to their sources. – Prune Mar 19 '21 at 22:37
  • "didn't seem to run" is not an adequate description of a problem. Nor is your description of "sample". In any case your requirements may be hard to achieve, at least not without some cleverness. Random sampling with repeats or without is easy to specify. Random of everything but with a couple of repeats may require two samplings or some such trick. – hpaulj Mar 19 '21 at 23:21
  • @hpaulj to be fair, I do think that OP's requirement is pretty clear **For example, I want a group of 6, so I'd expect my final rows to include at least one of every type of strain with two strains appearing twice.** i.e. trying to get the most even distribution possible of `Strain` in `n` samples. – Quang Hoang Mar 20 '21 at 01:53

1 Answers1

2

I believe there's something in numpy that does exactly this, but can't recall which. Here's a fairly fast approach:

  1. Shuffle the data for randomness
  2. enumerate the rows within each group
  3. sort by the enumeration above
  4. slice the top n rows

So in code:

n = 6

df = df.sample(frac=1)                      # step 1 
enums = df.groupby('Strain').cumcount()     # step 2 
        
orders = np.argsort(enums)                  # step 3
samples = df.iloc[orders[:n]]               # step 4

Output:

   Identifier Strain  Other columns, etc.
2           3      D                  NaN
7           8      B                  NaN
0           1      A                  NaN
5           6      C                  NaN
4           5      A                  NaN
8           9      D                  NaN
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74