Image preprocessing - Create a dataset for CNN

Question

I'm pretty new at CNN and have I need to build a pipeline that loads the images and also get them ready for the CNN. The thing is that I need to build a dataset formed by images. There are three classes of images: COVID-19, Healthy lungs and Pneumonia. The files that I have are:

1 folder containing images of lungs with covid-19
1 folder containing images of healthy lungs
1 folder containing images with pneumonia
1 .txt file that has all the images for which the training dataset will be formed
1 .txt file that has all the images for which the validation dataset will be formed
1 .txt file that has all the images for which the text dataset will be formed

I´ve been searching on Internet but I don´t reach to find a way to build a dataset made by all the images but not even how to relate them to the .txt files and build the related training, test and validation dataset. Any suggestion? Please, find below the structure of the .txt file as an example:

2   PNEUMONIA/person888_bacteria_2812.jpeg
2   PNEUMONIA/person1209_bacteria_3161.jpeg
2   PNEUMONIA/person1718_bacteria_4540.jpeg
2   PNEUMONIA/person549_bacteria_2303.jpeg
2   PNEUMONIA/person831_bacteria_2742.jpeg
2   PNEUMONIA/person1571_bacteria_4108.jpeg
2   PNEUMONIA/person1310_bacteria_3300.jpeg

you can write your own custom data generator, but in case you don't need any special augmentations or something like this, you can just use Keras' `ImageDataGenerator` class. The method `flow_from_directory` is what you are searching for (looping over sub-directories, treats every sub-directory as a different class). [link_to_documentation](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/image/ImageDataGenerator#flow_from_directory) — alivne, May 23 '20 at 18:34
this looks nice to create a whole dataset of images! After doing this, I would need to create a dataset for training, validation and test containing the images specified in the .txt files. How can I read the dataset, link it to the .txt file and create a new one? — Panri93, May 23 '20 at 18:48
3 options: (1) if you want to use this class, you can use "validation_split" argumunts to set the amount of data to be set as validation set. (2) however if you already chose the splitting yourself and want to use it, you can use the `flow_from_dataframe` method, but you need to create the data_frame yourself (3) save the test and train images at different locations yourself (and keep the sub-directories per label), and create different generator to each of the data roles. — alivne, May 23 '20 at 18:54

score 0 · Answer 1 · answered May 24 '20 at 01:32

0

is necessary that you follow the txt files for making the train and validation sets?

if not, you could

make a train/ directory make a train/covid directory make a train/healthy directory make a train/pneumonia directory

trow everything in the respective dirs, and the move randomly a fraction of the total images reccount in them to their validation directory simils

otherwise you should read each txt and pick the specific file and move it to the target folder.

answered May 24 '20 at 01:32

Javier Espinoza

44
5

yes, it is necessary. Each folder must contain the specific images contained in the .txt files. How can I read the .txt file and move the images? – Panri93 May 24 '20 at 08:14
1

you can make a list and move the files by looping over it. this can be of help: https://stackoverflow.com/questions/3277503/how-to-read-a-file-line-by-line-into-a-list – Javier Espinoza May 25 '20 at 01:37

Image preprocessing - Create a dataset for CNN

1 Answers1