How to use multiple inputs for tensorflow.keras.preprocessing.text_dataset_from_directory

Question

So I am training a CNN which takes in input two images and returns a single value as output on a GPU. To feed the data into batches since I have a lot of images I am using tf.keras.preprocessing.text_dataset_from_directory to create a tf.Dataset object which are optimized for GPUs.

So basically my input directory would be

Class_1/
  Class1_1/
     image1.png
     image2.png
  Class1_2/
     image3.png
     image4.png
...
Class_2/
  Class2_1/
     image1.png
     image2.png
  Class2_2/
     image3.png
     image4.png

The default function is only suited for the below structure

Class_1/
      image1.png
      image2.png
      image3.png
      image4.png
    ...
Class_2/
      image1.png
      image2.png
      image3.png
      image4.png

Any help would be appreciated.

ely · Accepted Answer · 2020-12-16T16:59:54.610

I presume you mean image_dataset_from_directory since you are loading images and not text data. Either way, you cannot produce batches with multiple inputs from these helper functions, you can see from the documentation that the return shape is defined:

A tf.data.Dataset object.

If label_mode is None, it yields float32 tensors of shape (batch_size, image_size[0], image_size[1], num_channels), encoding images (see below for rules regarding num_channels).

Otherwise, it yields a tuple (images, labels), where images has shape (batch_size, image_size[0], image_size[1], num_channels), and labels follows the format described below.

You will instead need to write your own custom generator function that yields multiple inputs loaded from your data directory, and then call fit with your custom generator and passing the kwarg validation_data a separate generator that generates validation data. (Note: in some older versions of Keras you may need fit_generator instead of fit).

Here's an example of a module of some helper functions that can read images from some directories and present them as multi-image inputs in training.

def _generate_batch(training):
    in1s, in2s, labels = [], [], []
    batch_tuples = _sample_batch_of_paths(training)
    for input1_path, input2_path in batch_tuples:
        # skip any exception so that image GPU batch loading isn't
        # disrupted and any faulty image is just skipped.
        try:
            in1_tmp = _load_image(
                os.path.join(INPUT1_PATH_PREFIX, input1_path),
            )
            in2_tmp = _load_image(
                os.path.join(INPUT2_PATH_PREFIX, input2_path),
            )
        except Exception as exc:
            print("Unhandled exception during image batch load. Skipping...")
            print(str(exc))
            continue
        # if no exception, both images loaded so both are added to batch.
        in1s.append(in1_tmp)
        in2s.append(in2_tmp)
        # Whatever your custom logic is to determine the label for the pair.
        labels.append(
            _label_calculation_helper(input1_path, input2_path)
        )
    in1s, in2s = map(skimage.io.concatenate_images, [in1s, in2s])
    # could also add a singleton channel dimension for grayscale images.
    # in1s = in1s[:, :, :, None]
    return [in1s, in2s], labels


def _make_generator(training=True):
    while True:
        yield _generate_batch(training)


def make_generators():
    return _make_generator(training=True), _make_generator(training=False)

The helper _load_image could be something like this:

def _load_image(path, is_gray=False):
    tmp = skimage.io.imread(path)
    if is_gray:
        tmp = skimage.util.img_as_float(skimage.color.rgb2gray(tmp))
    else:
        tmp = skimage.util.img_as_float(skimage.color.gray2rgb(tmp))
        if tmp.shape[-1] == 4:
            tmp = skimage.color.rgba2rgb(tmp)
    # Do other stuff here - resizing, clipping, etc.
    return tmp

and the helper function to sample a batch from a set of paths listed off disk could be like this:

@lru_cache(1)
def _load_and_split_input_paths():
    training_in1s, testing_in1s = train_test_split(
        os.listdir(INPUT1_PATH_PREFIX),
        test_size=TEST_SIZE,
        random_state=RANDOM_SEED
    )
    training_in2s, testing_in2s = train_test_split(
        os.listdir(INPUT2_PATH_PREFIX),
        test_size=TEST_SIZE,
        random_state=RANDOM_SEED
    )
    return training_in1s, testing_in1s, training_in2s, testing_in2s


def _sample_batch_of_paths(training):
    training_in1s, testing_in1s, training_in2s, testing_in2s = _load_and_split_input_paths()
    if training:
        return list(zip(
            random.sample(training_in1s, BATCH_SIZE),
            random.sample(training_in2s, BATCH_SIZE)
        ))
    else:
        return list(zip(
            random.sample(testing_in1s, BATCH_SIZE),
            random.sample(testing_in2s, BATCH_SIZE)
        ))

This would randomly sample images from some "input 1" directory and pair them with random samples from an "input 2" directory. Obviously in your use case you'll want to change this so that the data are pulled deterministically according to the file structure that defines their pairings and labelings.

Finally once you want to use this, you can call training code such as:

training_generator, testing_generator = make_generators()
try:
    some_compiled_model.fit(
        training_generator,
        epochs=EPOCHS,
        validation_data=testing_generator,
        callbacks=[...],
        verbose=VERBOSE,
        steps_per_epoch=STEPS_PER_EPOCH,
        validation_steps=VALIDATION_STEPS,
    )
except KeyboardInterrupt:
    pass

@ely My main concern to stay away from generators was the performance of them. In an ideal case, I would like the next batch processed using the CPU while the GPU is working on the previous batch. Is the above mentioned possible in the case of a python generator? — Anandha Krishnan H, Dec 17 '20 at 03:23
That is already how keras lets you handle batch generators, it can pre-fetche a queue of batches while the gpu is working (this has nothing to do with generators vs. other loading approaches). One of the main benefits of generators is performance and reduced memory overhead. It's often superior to tf Datasets because you have so much more control over the loading and preprocessing logic, yet you pay no worse of a penalty in load time or memory overhead. — ely, Dec 17 '20 at 19:50
See [here](https://stackoverflow.com/a/56251858/567620) (and the comment below it) for some of these config options. — ely, Dec 17 '20 at 19:51
@elyThank you for your detailed reply. With your answer and some other resources, I made a simple generator but I think it is not performing as I expected. For a sample dataset loading all the data at once was taking around 100sec per epoch but while using my generator it is about twice that at 220sec per epoch. When I used the same generator for my full dataset (~3 million images ) it was taking around 10 hours per epoch. Can my generator be at fault or can it be something else. — Anandha Krishnan H, Dec 21 '20 at 09:02
Can you add more detail about what you mean by "loading all data at once"? Typically with generators, you would only load metadata all at once (like 3 million file paths, not 3 million images), and then the batch generator will sample a batch from that metadata and only do the real data loading on-demand, at the time the generator is asked to produce the next batch. If you are trying to load a huge number of batches of data, you might be hitting some type of swap limit or other memory issue that is blocking keras. In general, 100 seconds per epoch sounds like something was (and still is) off. — ely, Dec 22 '20 at 16:17

How to use multiple inputs for tensorflow.keras.preprocessing.text_dataset_from_directory

1 Answers1