0

I have a dataframe like this

names = ["Patient 1", "Patient 2", "Patient 3", "Patient 4", "Patient 5", "Patient 6", "Patient 7"]
categories = ["Internal medicine, Gastroenterology", "Internal medicine, General Med, Endocrinology", "Pediatrics, Medical genetics, Laboratory medicine", "Internal medicine", "Endocrinology", "Pediatrics", "General Med, Laboratory medicine"]

zippedList =  list(zip(names, categories))
df = pd.DataFrame(zippedList, columns=['names', 'categories'])

yielding:

print(df)
names                                         categories
0  Patient 1                Internal medicine, Gastroenterology
1  Patient 2      Internal medicine, General Med, Endocrinology
2  Patient 3  Pediatrics, Medical genetics, Laboratory medicine
3  Patient 4                                  Internal medicine
4  Patient 5                                      Endocrinology
5  Patient 6                                         Pediatrics
6  Patient 7                   General Med, Laboratory medicine

(The real data-frame has >1000 rows)

and counting the categories yields:

print(df['categories'].str.split(", ").explode().value_counts())

Internal medicine      3
General Med            2
Endocrinology          2
Laboratory medicine    2
Pediatrics             2
Gastroenterology       1
Medical genetics       1

I would like to draw a random sub-sample of n rows so that each medial category is proportionally represented. e.g. 3 of 13 (~23%) categories are "Internal medicine". Therefore ~23% of the sub-sample should have this category. This wouldn't be too hard if each patient had 1 category but unfortunately they can have multiple (eg patient 3 got even 3 categories). How can I do this?

lordy
  • 610
  • 15
  • 30
  • I recommend you use `train_test_split` and pass the `stratify` argument. You can then define the size of your test set and use that in the rest of your code. See here: https://stackoverflow.com/a/36998108/5763165. – nick Nov 16 '20 at 11:05
  • 1
    `stratify` would work but it doesn't address the multi-label in one column issue ... – lordy Nov 16 '20 at 13:30
  • So it just depends if you want to provide more weight to the multi label options. You could treat those as a separate category and stratify across them as well. Other option could be to duplicate "patient" data for each category in list, then stratify, and then drop duplicates in final df at random. – nick Nov 16 '20 at 13:46
  • To truly keep the ratios that you are after (e.g., the 3/13 quoted above) you'd need to treat the different combinations as a separate category (e.g., you could just treat the list as a unique string) and then stratify across them. – nick Nov 16 '20 at 13:47

1 Answers1

0

The fact your patients have many categories doesn't affect the subsampling process. When you take n rows out of nrows (which is len(df) ), subsampling will maintain the categories weights, +/- the probability one class is more represented in your random subset -it converges to 0 as n gets higher-

Typically,

n = 2000
df2 = df.sample(n).copy(deep = True)
print(df2['categories'].str.split(", ").explode().value_counts())

should work the way you want.

I also read you have around 1000 categories. Do not forget to preprocess them before using, as some of them could disappear after your subsampling fit.

Ludovic H
  • 56
  • 3
  • 1
    Sampling each category to the correct size would guarantee the proportionality is kept rather than just hoping for convergence to the true ratios. Especially if you have categories that are very rare, you are likely to lose them with this approach. – nick Nov 16 '20 at 11:07