0

I've seen everywhere how to randomly select DataFrame rows in pandas (with and without numpy). What I haven't found is how to add a column to a DataFrame that indicates whether a row was randomly selected. Specifically, I need to

1) group rows by values in column A

2) randomly select 10 rows in each group without replacement

3) add a column B to indicate whether the each row was selected (TRUE/FALSE).

The result should be the original DataFrame (i.e., ungrouped) with an added column of TRUE/FALSE for every row (meaning, within its group, the row was selected during random selection).

I'm using python 3.6.2, pandas 0.20.3, numpy 1.13.1.

Edit in response to comments:

For this small sample of data, let's now say randomly select 2 rows without replacement per grouping by ImageType. Yes, the data sample does not have at least 2 of every ImageType. Please understand that the small dataset is to prevent making a really long post.

The data looks like this (there are thousands of rows):

+-----------+---------------------+
| ImageType |      FileName       |
+-----------+---------------------+
|         9 | PIC_001_01_0_9.JPG  |
|         9 | PIC_022_17_0_9.JPG  |
|        38 | PIC_100_00_0_38.jpg |
|         9 | PIC_293_12_0_9.JPG  |
|         9 | PIC_381_14_0_9.JPG  |
|        33 | PIC_001_17_2_33.JPG |
|         9 | PIC_012_07_0_9.JPG  |
|        28 | PIC_306_00_0_28.jpg |
|        28 | PIC_178_08_0_28.JPG |
|        26 | PIC_225_11_0_26.JPG |
|        18 | PIC_087_16_0_18.JPG |
|         9 | PIC_089_18_0_9.JPG  |
|        19 | PIC_090_18_0_19.JPG |
|         9 | PIC_091_18_0_9.JPG  |
|        19 | PIC_092_18_2_19.JPG |
|        23 | PIC_270_14_0_23.JPG |
|        13 | PIC_271_14_0_13.JPG |
+-----------+---------------------+

The code is only a read from .csv, but to recreate the sample data above:

import pandas as pd
df = pd.DataFrame({'ImageType': ['9','9','38','9','9','33','9','28','28','26',
                                 '18','9','19','9','19','23','13'],
                   'FileName': ['PIC_001_01_0_9.JPG','PIC_022_17_0_9.JPG',
                                'PIC_100_00_0_38.jpg','PIC_293_12_0_9.JPG',
                                'PIC_381_14_0_9.JPG','PIC_001_17_2_33.JPG',
                                'PIC_012_07_0_9.JPG','PIC_306_00_0_28.jpg',
                                'PIC_178_08_0_28.JPG','PIC_225_11_0_26.JPG',
                                'PIC_087_16_0_18.JPG','PIC_089_18_0_9.JPG',
                                'PIC_090_18_0_19.JPG','PIC_091_18_0_9.JPG',
                                'PIC_092_18_2_19.JPG','PIC_270_14_0_23.JPG',
                                'PIC_271_14_0_13.JPG']})
# group by ImageType
# select 2 rows randomly in each group, without replacement
# add a column to original DataFrame to indicate selected rows
Stephen Frost
  • 218
  • 4
  • 13
  • 2
    Can you create some sample data and expected output? – Scott Boston Feb 05 '18 at 22:04
  • Will provide asap. For now, let me say every column/series dtype is object (text, in this case), and column A for grouping has approximately 10 distinct values. – Stephen Frost Feb 05 '18 at 22:10
  • What you're asking for is probably not too difficult, but it's just much easier for people to answer your question if they have some code to cut-and-paste to recreate your situation. More: [mcve] and [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). – pault Feb 05 '18 at 22:15
  • @pault Thanks for the advice. Sample data and code added. – Stephen Frost Feb 06 '18 at 15:58
  • 1
    You may also find [this post](https://stackoverflow.com/questions/22472213/python-random-selection-per-group) helpful. – pault Feb 06 '18 at 16:43

1 Answers1

1
def get_sample(df, n=2):
    if len(df) <= n:
        df['Sampled'] = True
    else:
        s = df.sample(n=n)
        df['Sampled'] = df.apply(lambda x: x.name in s.index, axis=1)
    return df

grouped = df.groupby('ImageType')
new_df = grouped.apply(get_sample)

print(new_df)

               FileName ImageType  Sampled
0    PIC_001_01_0_9.JPG         9    False
1    PIC_022_17_0_9.JPG         9    False
2   PIC_100_00_0_38.jpg        38     True
3    PIC_293_12_0_9.JPG         9     True
4    PIC_381_14_0_9.JPG         9    False
5   PIC_001_17_2_33.JPG        33     True
6    PIC_012_07_0_9.JPG         9    False
7   PIC_306_00_0_28.jpg        28     True
8   PIC_178_08_0_28.JPG        28     True
9   PIC_225_11_0_26.JPG        26     True
10  PIC_087_16_0_18.JPG        18     True
11   PIC_089_18_0_9.JPG         9     True
12  PIC_090_18_0_19.JPG        19     True
13   PIC_091_18_0_9.JPG         9    False
14  PIC_092_18_2_19.JPG        19     True
15  PIC_270_14_0_23.JPG        23     True
16  PIC_271_14_0_13.JPG        13     True

If the number of choices in the group is less than the sample number it will sample all of them.

noslenkwah
  • 1,702
  • 1
  • 17
  • 26