I have a dataframe like this
names = ["Patient 1", "Patient 2", "Patient 3", "Patient 4", "Patient 5", "Patient 6", "Patient 7"]
categories = ["Internal medicine, Gastroenterology", "Internal medicine, General Med, Endocrinology", "Pediatrics, Medical genetics, Laboratory medicine", "Internal medicine", "Endocrinology", "Pediatrics", "General Med, Laboratory medicine"]
zippedList = list(zip(names, categories))
df = pd.DataFrame(zippedList, columns=['names', 'categories'])
yielding:
print(df)
names categories
0 Patient 1 Internal medicine, Gastroenterology
1 Patient 2 Internal medicine, General Med, Endocrinology
2 Patient 3 Pediatrics, Medical genetics, Laboratory medicine
3 Patient 4 Internal medicine
4 Patient 5 Endocrinology
5 Patient 6 Pediatrics
6 Patient 7 General Med, Laboratory medicine
(The real data-frame has >1000 rows)
and counting the categories yields:
print(df['categories'].str.split(", ").explode().value_counts())
Internal medicine 3
General Med 2
Endocrinology 2
Laboratory medicine 2
Pediatrics 2
Gastroenterology 1
Medical genetics 1
I would like to draw a random sub-sample of n
rows so that each medial category is proportionally represented. e.g. 3 of 13 (~23%) categories are "Internal medicine". Therefore ~23% of the sub-sample should have this category. This wouldn't be too hard if each patient had 1 category but unfortunately they can have multiple (eg patient 3 got even 3 categories). How can I do this?