I am trying to split the above dataframe into train (80%), validation (10%), and test (10%); however, I want to maintain almost equal number of diseases in each set. The Finding_Labels column has the list of diseases an image is linked to.
Each image may be linked to more than one disease, as seen in the first row – this is a little problematic. Therefore, how can I split this in a manner where the 3 sets have almost equal number of disease?
An answer in PyTorch would be appreciated.
Name of disease and count:
{'Atelectasis': 391,
'Infiltration': 1181,
'No Finding': 3479,
'Emphysema': 116,
'Pneumonia': 116,
'Pleural_Thickening': 130,
'Pneumothorax': 361,
'Mass': 241,
'Nodule': 190,
'Consolidation': 299,
'Edema': 124,
'Cardiomegaly': 157,
'Effusion': 506,
'Fibrosis': 25,
'Hernia': 1}