2

I have a dataframe df with patients subject_id, including their gender and their age.

I would like to draw a random sample of size n from this dataframe, with the following characteristics:

  • 50% male, 50% female
  • Median age of 40 years

Any idea how I could accomplish that using python? Thank you!

Dominic I.
  • 35
  • 4
  • 1
    Does this answer your question? [Random Sample of a subset of a dataframe in Pandas](https://stackoverflow.com/questions/38085547/random-sample-of-a-subset-of-a-dataframe-in-pandas) – Yoshikage Kira Dec 01 '21 at 08:54

1 Answers1

1

I think what you want is a little bit more complex than what DataFrame.sample provides out of the box. A random sample satisfying each of your conditions could be generated (respectively) like this:

  1. Filter for women only, and randomly sample n/2, then do the same for men, and then pool them
  2. Filter for under 40s, randomly sample n/2, then do the same for over-40s and then combine them. (Though note that this does not guarantee a median of exactly 40.)

If you want to combine the two constraints, you might need to sample 4 times - women under 40, men under 40, etc. But this is the general idea.

Code for sampling would look like:

df.loc[df.age < 40, 'subject_id'].sample(n/2)
df.loc[df.gender == 'F', 'subject_id'].sample(n/2)
Josh Friedlander
  • 10,870
  • 5
  • 35
  • 75