1

before splitting the dataset I need to randomly load the data and then do splitting . This is the snippet for splitting the dataset which is not randomly. I am wondering how can I do this for images and corresponding mask in folder_mask?

folder_data = glob.glob("D:\\Neda\\Pytorch\\U-net\\my_data\\imagesResized\\*.png")
folder_mask = glob.glob("D:\\Neda\\Pytorch\\U-net\\my_data\\labelsResized\\*.png") 

 # split these path using a certain percentage
  len_data = len(folder_data)
  print("count of dataset: ", len_data)
  # count of dataset:  992


  split_1 = int(0.6 * len(folder_data))
  split_2 = int(0.8 * len(folder_data))

  #folder_data.sort()

  train_image_paths = folder_data[:split_1]
  print("count of train images is: ", len(train_image_paths)) 

  valid_image_paths = folder_data[split_1:split_2]
  print("count of validation image is: ", len(valid_image_paths))

  test_image_paths = folder_data[split_2:]
  print("count of test images is: ", len(test_image_paths)) 

   train_mask_paths = folder_mask[:split_1]
   valid_mask_paths = folder_mask[split_1:split_2]
   test_mask_paths = folder_mask[split_2:]

   train_dataset = CustomDataset(train_image_paths, train_mask_paths)
   train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=1, 
   shuffle=True, num_workers=2)

   valid_dataset = CustomDataset(valid_image_paths, valid_mask_paths)
   valid_loader = torch.utils.data.DataLoader(valid_dataset, batch_size=1, 
   shuffle=True, num_workers=2)

    test_dataset = CustomDataset(test_image_paths, test_mask_paths)
    test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=1, 
     shuffle=False, num_workers=2)  

     dataLoaders = {
          'train': train_loader,
          'valid': valid_loader,
           'test': test_loader,
                 }
AI_NA
  • 336
  • 2
  • 5
  • 20

3 Answers3

1

As far as I understood, you want to randomize the order of the pictures, so that with each rerun there are different photos in the train and test set. Assuming you want to do this in more or less plain Python you can do the following.

The easiest way to use shuffle a list of elements in python is:

import random
random.shuffle(list)  // shuffles in place

So you have to list and want to still keep the link between data and masks. So if you can accept a rather quick hack, I'd propose something like this.

import random

folder_data = glob.glob("D:\\Neda\\Pytorch\\U-net\\my_data\\imagesResized\\*.png")
folder_mask = glob.glob("D:\\Neda\\Pytorch\\U-net\\my_data\\labelsResized\\*.png") 

assert len(folder_data) == len(folder_mask) // everything else would be bad

indices = list(range(len(folder_data)))
random.shuffle(indices)

Now you have a list of indices you can split and then use the indices from the splitted list to access the other lists.

split_1 = int(0.6 * len(folder_data))
split_2 = int(0.8 * len(folder_data))

train_image_paths = [folder_data[i] for i in indices]
// and so on...

This would be the plain Python way. But there are functions to do this in packages like sklearn. So you might consider using those. They'll gonna save you from doing a lot of work. (Usually it's way better to reuse code then to implement it yourself.)

Ture
  • 76
  • 2
0

Try using sklearn.model_selection.train_test_split.

from sklearn.model_selection import train_test_split 

train_image_paths, test_image_paths, train_mask_paths, test_mask_paths 
  = train_test_split(folder_data, folder_mask, test_size=0.2)

This will split your data and labels into corresponding train and test sets. If you need a validation set, you can use it twice -- i.e. first to split into train/test, and then again on the training subset to split it into train/val.

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

nkaushik
  • 71
  • 4
0

If I get this right, there is one mask for each sample, right? Use pandas to have the data and mask paired and then split them randomly with a help function:

import glob
import pandas as pd
import numpy as np

def train_validate_test_split(df, train_percent=.6, validate_percent=.2, seed=None):
    np.random.seed(seed)
    perm = np.random.permutation(df.index)
    m = len(df.index)
    train_end = int(train_percent * m)
    validate_end = int(validate_percent * m) + train_end
    train = df.ix[perm[:train_end]]
    validate = df.ix[perm[train_end:validate_end]]
    test = df.ix[perm[validate_end:]]
    return train, validate, test

folder_data = glob.glob("D:\\Neda\\Pytorch\\U-net\\my_data\\imagesResized\\*.png")
folder_mask = glob.glob("D:\\Neda\\Pytorch\\U-net\\my_data\\labelsResized\\*.png")

data_mask = pd.DataFrame({"data": folder_data, "mask": folder_mask})

train, validate, test = train_validate_test_split(data_mask)

Credits of helper function from @piRSquared answer in this question

Hemerson Tacon
  • 2,419
  • 1
  • 16
  • 28