Split image dataset into train-test datasets

Question

So I have a main folder which contains sub-folders which in turn contains images for the dataset as follows.

-main_db

---CLASS_1

-----img_1

-----img_2

-----img_3

-----img_4

---CLASS_2

-----img_1

-----img_2

-----img_3

-----img_4

---CLASS_3

-----img_1

-----img_2

-----img_3

-----img_4

I need to split this dataset into 2 parts i.e Train data(70%) and Test data(30%). Below is the hierarchy I want to achieve

-main_db

---training_data

-----CLASS_1

-------img_1

-------img_2

-------img_3

-------img_4

---CLASS_2

-------img_1

-------img_2

-------img_3

-------img_4

---testing_data

-----CLASS_1

-------img_5

-------img_6

-------img_7

-------img_8

---CLASS_2

-------img_5

-------img_6

-------img_7

-------img_8

Any help appreciated. Thanks

I have tried this module. But this is not working for me. This module is not being imported at all.

https://github.com/jfilter/split-folders

This is exactly what I want.

You seem to have found a solution yourself but the tool doesn't work. Since this is a very specific question and is unlikely to aid a general audience, try filing an issue with `split-folders` if you experience problems. They are far more likely to aid you than people here! — nemo, Aug 07 '19 at 12:11
@nemo You are absolutely right and I have already opened an issue on their repo. ! — Ishan Dixit, Aug 07 '19 at 12:47
If hypothetically assuming I have 20 images in all the sub folders then Training set folder must contain 16 images and testing set contains 4 images. This split is considering 80%-20% split ratio. @AriCooper-Davis — Ishan Dixit, Aug 07 '19 at 12:51
The module `split-folders` solve this problem (I'm the author). Not sure what why it wasn't working for you. — Johannes Filter, Aug 04 '20 at 21:33

lomovi · Accepted Answer · 2019-08-07T12:23:48.673

This should do it. It will calculate how many images are in each folder and then splits them accordingly, saving test data in a different folder with the same structure. Save the code in main.py file and run command:

python3 main.py ----data_path=/path1 --test_data_path_to_save=/path2 --train_ratio=0.7

import shutil
import os
import numpy as np
import argparse

def get_files_from_folder(path):

    files = os.listdir(path)
    return np.asarray(files)

def main(path_to_data, path_to_test_data, train_ratio):
    # get dirs
    _, dirs, _ = next(os.walk(path_to_data))

    # calculates how many train data per class
    data_counter_per_class = np.zeros((len(dirs)))
    for i in range(len(dirs)):
        path = os.path.join(path_to_data, dirs[i])
        files = get_files_from_folder(path)
        data_counter_per_class[i] = len(files)
    test_counter = np.round(data_counter_per_class * (1 - train_ratio))

    # transfers files
    for i in range(len(dirs)):
        path_to_original = os.path.join(path_to_data, dirs[i])
        path_to_save = os.path.join(path_to_test_data, dirs[i])

        #creates dir
        if not os.path.exists(path_to_save):
            os.makedirs(path_to_save)
        files = get_files_from_folder(path_to_original)
        # moves data
        for j in range(int(test_counter[i])):
            dst = os.path.join(path_to_save, files[j])
            src = os.path.join(path_to_original, files[j])
            shutil.move(src, dst)


def parse_args():
  parser = argparse.ArgumentParser(description="Dataset divider")
  parser.add_argument("--data_path", required=True,
    help="Path to data")
  parser.add_argument("--test_data_path_to_save", required=True,
    help="Path to test data where to save")
  parser.add_argument("--train_ratio", required=True,
    help="Train ratio - 0.7 means splitting data in 70 % train and 30 % test")
  return parser.parse_args()

if __name__ == "__main__":
  args = parse_args()
  main(args.data_path, args.test_data_path_to_save, float(args.train_ratio))

Will try this piece and let you know ! Thanks for the answer ! — Ishan Dixit, Aug 07 '19 at 12:52

score 13 · Answer 2 · edited Nov 11 '21 at 12:46

13

If you are not too keen on coding, there is a python package called split-folders that you could use. It is extremely easy to use and can be found here Here is how it can be used.

pip install split-folders
import split_folders # or import splitfolders
input_folder = "/path/to/input/folder"
output = "/path/to/output/folder" #where you want the split datasets saved. one will be created if it does not exist or none is set

split_folders.ratio(input_folder, output=output, seed=42, ratio=(.8, .1, .1)) # ratio of split are in order of train/val/test. You can change to whatever you want. For train/val sets only, you could do .75, .25 for example.

However, I strongly recommend coding answers presented above because they help you learn.

edited Nov 11 '21 at 12:46

Malgo

1,871
1
17
30

answered Jul 27 '20 at 15:24

Misan

177
1
6

This is a package, not a website. It will be available via pypi for a long time. – Johannes Filter Aug 04 '20 at 21:31
1

The usage is `splitfolders` not `split-folders`, you can refer into the original pip installations [here](https://pypi.org/project/split-folders/) – theDreamer911 Apr 14 '22 at 07:06

Dipendra Pant · Answer 3 · 2020-11-13T18:12:46.100

** Visit this link https://www.kaggle.com/questions-and-answers/102677 Credit goes to "saravanansaminathan" comment on Kaggle For the same problem on my datasets with the following folder structure. /TTSplit /0 /001_01.jpg ....... /1 /001_04.jpg ....... I did follow by taking the above link as a reference.**

import os
import numpy as np
import shutil
import random
root_dir = '/home/dipak/Desktop/TTSplit/'
classes_dir = ['0', '1']

test_ratio = 0.20

for cls in classes_dir:
    os.makedirs(root_dir +'train/' + cls)
    os.makedirs(root_dir +'test/' + cls)

src = root_dir + cls

allFileNames = os.listdir(src)
np.random.shuffle(allFileNames)
train_FileNames, test_FileNames = np.split(np.array(allFileNames),
                                                          [int(len(allFileNames)* (1 - test_ratio))])


train_FileNames = [src+'/'+ name for name in train_FileNames.tolist()]
test_FileNames = [src+'/' + name for name in test_FileNames.tolist()]

print("*****************************")
print('Total images: ', len(allFileNames))
print('Training: ', len(train_FileNames))
print('Testing: ', len(test_FileNames))
print("*****************************")


lab = ['0', '1']

for name in train_FileNames:
    for i in lab:
        shutil.copy(name, root_dir +'train/' + i)

for name in test_FileNames:
    for i in lab:
        shutil.copy(name, root_dir +'test/' + i)
print("Copying Done!")

score 3 · Answer 4 · edited Mar 05 '21 at 17:54

3

data = os.listdir(image_directory)

from sklearn.model_selection import train_test_split
train, valid = train_test_split(data, test_size=0.2, random_state=1)

Then you may use shutil to copy the images into your desired folder

edited Mar 05 '21 at 17:54

Mohnish

1,010
1
12
20

answered Mar 05 '21 at 03:16

Ikram.Inf

55
2

Thanks. It worked for me. My dataset structure is different. It has only two folders images and labels. No class wise folders inside images folder. Inside images folder, there are all jpg files. – Kevin Patel Jul 17 '22 at 23:51

score 0 · Answer 5 · answered Oct 31 '20 at 05:55

if you check in their documentation here, they have updated the syntax. basically, I faced a similar issue but I found the following new syntax to be working as per there update:

import splitfolders  # or import split_folders
splitfolders.ratio("input_folder", output="output", seed=1337, ratio=(.8, .1, .1), 
group_prefix=None) # default values

# Split with a ratio.
#To only split into training and validation set, set a tuple to `ratio`, i.e,`(.8,    
# .2)`.
splitfolders.ratio("input_folder", output="output", seed=1337, ratio=(.8, .1, .1), 
group_prefix=None) # default values

# Split val/test with a fixed number of items e.g. 100 for each set.
# To only split into training and validation set, use a single number to `fixed`, 
i.e., 
# `10`.
splitfolders.fixed("input_folder", output="output", seed=1337, fixed=(100, 100), 
oversample=False, group_prefix=None) # default values

score 0 · Answer 6 · answered Jun 28 '21 at 19:22

What about this?

from pathlib import Path
from sklearn.model_selection import  StratifiedShuffleSplit
import shutil

def image_train_test_split(path, fmt, train_size):
  train_folder = Path('train')
  test_folder = Path('test')

  train_folder.mkdir(exist_ok=True)
  test_folder.mkdir(exist_ok=True)

  data_path = Path(path)
  data = []
  for d in data_path.glob('*'):
    for f in d.glob(f'*.{fmt}'):
      data.append([f, d.stem])
  data = np.array(data)

  ss = StratifiedShuffleSplit(1, train_size=0.8)
  train_ix, test_ix = next(ss.split(data[:,0], data[:,1]))

  train_set, test_set = data[train_ix], data[test_ix]

  for p, c in train_set:
    
    (train_folder / c).mkdir(exist_ok=True)
    shutil.move(p, train_folder.joinpath(*p.parts[-2:]))

  for p, c in test_set:
    
    (test_folder / c).mkdir(exist_ok=True)
    shutil.move(p, test_folder.joinpath(*p.parts[-2:]))

score 0 · Answer 7 · answered Jan 30 '22 at 10:47

I needed something like @Dipendra Pant idea but his code wasn't working for me. I think it has some identation error in the for loop. Anyway, strongly based in his answer, here's the solution that worked for me: It reads from a folder with 5 subfolders (my 5 classes), and save all that in 3 folders (train_ds, test_ds, val_ds), everyone with 5 subfolders inside, just ready for use image_dataset_from_directory with shuffle= False (the shuffling is already done in this code).

import os
import numpy as np
import shutil
import random
root_dir = base_folder+"input/House_Room_Dataset-5_rooms/" # for requesting directly pics
classes_dir = os.listdir(root_dir)

train_ratio = 0.6
val_ratio  = 0.1

for cls in classes_dir:
    os.makedirs(input_destination +'train_ds/' + cls, exist_ok=True)
    os.makedirs(input_destination +'test_ds/' + cls, exist_ok=True)
    os.makedirs(input_destination +'val_ds/' + cls, exist_ok=True)
    
    # for each class, let's counts its elements
    src = root_dir + cls
    allFileNames = os.listdir(src)

    # shuffle it and split into train/test/va
    np.random.shuffle(allFileNames)
    train_FileNames, test_FileNames, val_FileNames = np.split(np.array(allFileNames),[int(train_ratio * len(allFileNames)), int((1-val_ratio) * len(allFileNames))])
    
    # save their initial path
    train_FileNames = [src+'/'+ name  for name in train_FileNames.tolist()]
    test_FileNames  = [src+'/' + name for name in test_FileNames.tolist()]
    val_FileNames   = [src+'/' + name for name in val_FileNames.tolist()]
    print("\n *****************************",
          "\n Total images: ",cls, len(allFileNames),
          '\n Training: ', len(train_FileNames),
          '\n Testing: ', len(test_FileNames),
          '\n Validation: ', len(val_FileNames),
          '\n *****************************')
    
    # copy files from the initial path to the final folders
    for name in train_FileNames:
      shutil.copy(name, input_destination +'train_ds/' + cls)
    for name in test_FileNames:
      shutil.copy(name, input_destination +'test_ds/' + cls)
    for name in val_FileNames:
      shutil.copy(name, input_destination +'val_ds/' + cls)


# checking everything was fine
paths = ['train_ds/', 'test_ds/','val_ds/']
for p in paths:
  for dir,subdir,files in os.walk(input_destination + p):
    print(dir,' ', p, str(len(files)))

score 0 · Answer 8 · edited Jul 23 '22 at 12:27

0

There seems to be an update in the split-folders library. This is the only code that worked perfectly on Google Colab.

!pip install split_folders

import splitfolders

input_folder = "/content/Input_Folder" #Enter Input Folder
output = "/content/Output_Folder" #Enter Output Folder

splitfolders.ratio(input_folder, output=output, seed=42, ratio=(0.8,0.2))

edited Jul 23 '22 at 12:27

RiveN

2,595
11
13
26

answered Jul 21 '22 at 08:14

NurobotX

1
1

Split image dataset into train-test datasets

8 Answers8