-1

Dataframe Image I am trying to split the above dataframe into train (80%), validation (10%), and test (10%); however, I want to maintain almost equal number of diseases in each set. The Finding_Labels column has the list of diseases an image is linked to.

Each image may be linked to more than one disease, as seen in the first row – this is a little problematic. Therefore, how can I split this in a manner where the 3 sets have almost equal number of disease?

An answer in PyTorch would be appreciated.

Name of disease and count:

{'Atelectasis': 391,
         'Infiltration': 1181,
         'No Finding': 3479,
         'Emphysema': 116,
         'Pneumonia': 116,
         'Pleural_Thickening': 130,
         'Pneumothorax': 361,
         'Mass': 241,
         'Nodule': 190,
         'Consolidation': 299,
         'Edema': 124,
         'Cardiomegaly': 157,
         'Effusion': 506,
         'Fibrosis': 25,
         'Hernia': 1}
desertnaut
  • 57,590
  • 26
  • 140
  • 166
Shloakr
  • 89
  • 1
  • 1
  • 4
  • Is there a specific reason you want to use `PyTorch` to split your data? I would use the splitting method of [sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html). You can use the argument `stratify` to maintain an equal label ratio between testing and training data. – ko3 Jul 01 '22 at 07:04
  • @ko3 no, actually. My team uses PyTorch so I said that. However, how will I use stratify here? – Shloakr Jul 01 '22 at 07:08
  • If you are allowed and willing to use `sklearn`, have a look at [this](https://stackoverflow.com/questions/29438265/stratified-train-test-split-in-scikit-learn#:~:text=As%20such%2C%20it%20is%20desirable%20to%20split%20the,to%20the%20y%20component%20of%20the%20original%20dataset.) post. – ko3 Jul 01 '22 at 07:14

1 Answers1

1

As mentioned in the comments, I would use sklearn to solve your problem.

Suppose, for instance, you have the following feature and label data sets, where the number of one attributes (y==1).sum() equals the number of zero attributes (y==0).sum() (so you have a one-to-zero ratio of 1.0):

from sklearn import datasets

X, y = datasets.make_blobs(n_samples=10000, centers=2, random_state=0)

print(np.bincount(y)[0]/np.bincount(y)[1]) # 1.0

Then, you can split your data into training, validation and testing data while maintaining the same attribute-ratio like this:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42, test_size=0.2)
X_val, X_test, y_val, y_test = train_test_split(X_test, y_test, stratify=y_test, random_state=42, test_size=0.5)

print(np.bincount(y_train)[0]/np.bincount(y_train)[1]) # 1.0
print(np.bincount(y_test)[0]/np.bincount(y_test)[1])   # 1.0
print(np.bincount(y_val)[0]/np.bincount(y_val)[1])     # 1.0
print(len(y_train), len(y_val), len(y_test))           # 8000, 1000, 1000
ko3
  • 1,757
  • 5
  • 13
  • However, if you see the dataset, each image has multiple labels. So then I can't use a binary ratio, like you showed. Do you know how I can do this for multi-lable classification? It would be best if you can show an example using my dataset. – Shloakr Jul 01 '22 at 07:56
  • @Shloakr, the label 'Hernia' only occurs one time.. Therefore, it cannot be stratified. Can you cluster your labels? Or collect more data? – ko3 Jul 01 '22 at 08:20
  • I can cluster it with any other label. It shouldn't matter. – Shloakr Jul 01 '22 at 10:15