Python random sample from dataframe with given characteristics

Question

I have a dataframe df with patients subject_id, including their gender and their age.

I would like to draw a random sample of size n from this dataframe, with the following characteristics:

50% male, 50% female
Median age of 40 years

Any idea how I could accomplish that using python? Thank you!

Does this answer your question? [Random Sample of a subset of a dataframe in Pandas](https://stackoverflow.com/questions/38085547/random-sample-of-a-subset-of-a-dataframe-in-pandas) — Yoshikage Kira, Dec 01 '21 at 08:54

score 1 · Accepted Answer · answered Dec 01 '21 at 09:04

I think what you want is a little bit more complex than what DataFrame.sample provides out of the box. A random sample satisfying each of your conditions could be generated (respectively) like this:

Filter for women only, and randomly sample n/2, then do the same for men, and then pool them
Filter for under 40s, randomly sample n/2, then do the same for over-40s and then combine them. (Though note that this does not guarantee a median of exactly 40.)

If you want to combine the two constraints, you might need to sample 4 times - women under 40, men under 40, etc. But this is the general idea.

Code for sampling would look like:

df.loc[df.age < 40, 'subject_id'].sample(n/2)
df.loc[df.gender == 'F', 'subject_id'].sample(n/2)

Python random sample from dataframe with given characteristics

1 Answers1