I've seen everywhere how to randomly select DataFrame rows in pandas (with and without numpy). What I haven't found is how to add a column to a DataFrame that indicates whether a row was randomly selected. Specifically, I need to
1) group rows by values in column A
2) randomly select 10 rows in each group without replacement
3) add a column B to indicate whether the each row was selected (TRUE/FALSE).
The result should be the original DataFrame (i.e., ungrouped) with an added column of TRUE/FALSE for every row (meaning, within its group, the row was selected during random selection).
I'm using python 3.6.2, pandas 0.20.3, numpy 1.13.1.
Edit in response to comments:
For this small sample of data, let's now say randomly select 2 rows without replacement per grouping by ImageType. Yes, the data sample does not have at least 2 of every ImageType. Please understand that the small dataset is to prevent making a really long post.
The data looks like this (there are thousands of rows):
+-----------+---------------------+
| ImageType | FileName |
+-----------+---------------------+
| 9 | PIC_001_01_0_9.JPG |
| 9 | PIC_022_17_0_9.JPG |
| 38 | PIC_100_00_0_38.jpg |
| 9 | PIC_293_12_0_9.JPG |
| 9 | PIC_381_14_0_9.JPG |
| 33 | PIC_001_17_2_33.JPG |
| 9 | PIC_012_07_0_9.JPG |
| 28 | PIC_306_00_0_28.jpg |
| 28 | PIC_178_08_0_28.JPG |
| 26 | PIC_225_11_0_26.JPG |
| 18 | PIC_087_16_0_18.JPG |
| 9 | PIC_089_18_0_9.JPG |
| 19 | PIC_090_18_0_19.JPG |
| 9 | PIC_091_18_0_9.JPG |
| 19 | PIC_092_18_2_19.JPG |
| 23 | PIC_270_14_0_23.JPG |
| 13 | PIC_271_14_0_13.JPG |
+-----------+---------------------+
The code is only a read from .csv, but to recreate the sample data above:
import pandas as pd
df = pd.DataFrame({'ImageType': ['9','9','38','9','9','33','9','28','28','26',
'18','9','19','9','19','23','13'],
'FileName': ['PIC_001_01_0_9.JPG','PIC_022_17_0_9.JPG',
'PIC_100_00_0_38.jpg','PIC_293_12_0_9.JPG',
'PIC_381_14_0_9.JPG','PIC_001_17_2_33.JPG',
'PIC_012_07_0_9.JPG','PIC_306_00_0_28.jpg',
'PIC_178_08_0_28.JPG','PIC_225_11_0_26.JPG',
'PIC_087_16_0_18.JPG','PIC_089_18_0_9.JPG',
'PIC_090_18_0_19.JPG','PIC_091_18_0_9.JPG',
'PIC_092_18_2_19.JPG','PIC_270_14_0_23.JPG',
'PIC_271_14_0_13.JPG']})
# group by ImageType
# select 2 rows randomly in each group, without replacement
# add a column to original DataFrame to indicate selected rows