In pandas DataFrame, how to add column showing random selection result?

Question

I've seen everywhere how to randomly select DataFrame rows in pandas (with and without numpy). What I haven't found is how to add a column to a DataFrame that indicates whether a row was randomly selected. Specifically, I need to

1) group rows by values in column A

2) randomly select 10 rows in each group without replacement

3) add a column B to indicate whether the each row was selected (TRUE/FALSE).

The result should be the original DataFrame (i.e., ungrouped) with an added column of TRUE/FALSE for every row (meaning, within its group, the row was selected during random selection).

I'm using python 3.6.2, pandas 0.20.3, numpy 1.13.1.

Edit in response to comments:

For this small sample of data, let's now say randomly select 2 rows without replacement per grouping by ImageType. Yes, the data sample does not have at least 2 of every ImageType. Please understand that the small dataset is to prevent making a really long post.

The data looks like this (there are thousands of rows):

+-----------+---------------------+
| ImageType |      FileName       |
+-----------+---------------------+
|         9 | PIC_001_01_0_9.JPG  |
|         9 | PIC_022_17_0_9.JPG  |
|        38 | PIC_100_00_0_38.jpg |
|         9 | PIC_293_12_0_9.JPG  |
|         9 | PIC_381_14_0_9.JPG  |
|        33 | PIC_001_17_2_33.JPG |
|         9 | PIC_012_07_0_9.JPG  |
|        28 | PIC_306_00_0_28.jpg |
|        28 | PIC_178_08_0_28.JPG |
|        26 | PIC_225_11_0_26.JPG |
|        18 | PIC_087_16_0_18.JPG |
|         9 | PIC_089_18_0_9.JPG  |
|        19 | PIC_090_18_0_19.JPG |
|         9 | PIC_091_18_0_9.JPG  |
|        19 | PIC_092_18_2_19.JPG |
|        23 | PIC_270_14_0_23.JPG |
|        13 | PIC_271_14_0_13.JPG |
+-----------+---------------------+

The code is only a read from .csv, but to recreate the sample data above:

import pandas as pd
df = pd.DataFrame({'ImageType': ['9','9','38','9','9','33','9','28','28','26',
                                 '18','9','19','9','19','23','13'],
                   'FileName': ['PIC_001_01_0_9.JPG','PIC_022_17_0_9.JPG',
                                'PIC_100_00_0_38.jpg','PIC_293_12_0_9.JPG',
                                'PIC_381_14_0_9.JPG','PIC_001_17_2_33.JPG',
                                'PIC_012_07_0_9.JPG','PIC_306_00_0_28.jpg',
                                'PIC_178_08_0_28.JPG','PIC_225_11_0_26.JPG',
                                'PIC_087_16_0_18.JPG','PIC_089_18_0_9.JPG',
                                'PIC_090_18_0_19.JPG','PIC_091_18_0_9.JPG',
                                'PIC_092_18_2_19.JPG','PIC_270_14_0_23.JPG',
                                'PIC_271_14_0_13.JPG']})
# group by ImageType
# select 2 rows randomly in each group, without replacement
# add a column to original DataFrame to indicate selected rows

Will provide asap. For now, let me say every column/series dtype is object (text, in this case), and column A for grouping has approximately 10 distinct values. — Stephen Frost, Feb 05 '18 at 22:10
What you're asking for is probably not too difficult, but it's just much easier for people to answer your question if they have some code to cut-and-paste to recreate your situation. More: [mcve] and [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). — pault, Feb 05 '18 at 22:15
You may also find [this post](https://stackoverflow.com/questions/22472213/python-random-selection-per-group) helpful. — pault, Feb 06 '18 at 16:43

score 1 · Accepted Answer · answered Feb 06 '18 at 16:33

def get_sample(df, n=2):
    if len(df) <= n:
        df['Sampled'] = True
    else:
        s = df.sample(n=n)
        df['Sampled'] = df.apply(lambda x: x.name in s.index, axis=1)
    return df

grouped = df.groupby('ImageType')
new_df = grouped.apply(get_sample)

print(new_df)

               FileName ImageType  Sampled
0    PIC_001_01_0_9.JPG         9    False
1    PIC_022_17_0_9.JPG         9    False
2   PIC_100_00_0_38.jpg        38     True
3    PIC_293_12_0_9.JPG         9     True
4    PIC_381_14_0_9.JPG         9    False
5   PIC_001_17_2_33.JPG        33     True
6    PIC_012_07_0_9.JPG         9    False
7   PIC_306_00_0_28.jpg        28     True
8   PIC_178_08_0_28.JPG        28     True
9   PIC_225_11_0_26.JPG        26     True
10  PIC_087_16_0_18.JPG        18     True
11   PIC_089_18_0_9.JPG         9     True
12  PIC_090_18_0_19.JPG        19     True
13   PIC_091_18_0_9.JPG         9    False
14  PIC_092_18_2_19.JPG        19     True
15  PIC_270_14_0_23.JPG        23     True
16  PIC_271_14_0_13.JPG        13     True

If the number of choices in the group is less than the sample number it will sample all of them.

In pandas DataFrame, how to add column showing random selection result?

1 Answers1