I think what you want is a little bit more complex than what DataFrame.sample
provides out of the box. A random sample satisfying each of your conditions could be generated (respectively) like this:
- Filter for women only, and randomly sample
n/2
, then do the same for men, and then pool them
- Filter for under 40s, randomly sample
n/2
, then do the same for over-40s and then combine them. (Though note that this does not guarantee a median of exactly 40.)
If you want to combine the two constraints, you might need to sample 4 times - women under 40, men under 40, etc. But this is the general idea.
Code for sampling would look like:
df.loc[df.age < 40, 'subject_id'].sample(n/2)
df.loc[df.gender == 'F', 'subject_id'].sample(n/2)