Tensorflow does not apply data augmentation properly

Question

I'm trying to apply the process of data augmentation to a database. I use the following code:

train_generator = keras.utils.image_dataset_from_directory(
        directory= train_dir,
        subset = "training",
        image_size = (50,50),
        batch_size = 32,
        validation_split = 0.3,
        seed = 1337,
        labels = "inferred",
        label_mode = 'binary'
    )

    

    validation_generator = keras.utils.image_dataset_from_directory(
        subset="validation",
        directory=validation_dir,
        image_size=(50,50),
        batch_size =40,
        seed=1337,
        validation_split = 0.3,
        labels = "inferred",
        label_mode ='binary'
    )

    
    
    data_augmentation = keras.Sequential([
        keras.layers.RandomFlip("horizontal"),
        keras.layers.RandomRotation(0.1),
        keras.layers.RandomZoom(0.1),
    ])
    

    train_dataset = train_generator.map(lambda x, y: (data_augmentation(x, training=True), y))

But when I try to run the training processe using this method, I get a "insuficient data" warning:

6/100 [>.............................] - ETA: 21s - loss: 0.7602 - accuracy: 0.5200WARNING:tensorflow:Your input ran out of data; interrupting training. Make sure that your dataset or generator can generate at least `steps_per_epoch * epochs` batches (in this case, 2000 batches). You may need to use the repeat() function when building your dataset.
WARNING:tensorflow:Your input ran out of data; interrupting training. Make sure that your dataset or generator can generate at least `steps_per_epoch * epochs` batches (in this case, 10 batches). You may need to use the repeat() function when building your dataset.

Yes, the original dataset is insuficient, but the data augmentation should provide more than enough data for the training. Does anyone know what's going on ?

EDIT:

fit call:

history = model.fit( 
train_dataset, 
epochs = 20, 
steps_per_epoch = 100, 
validation_data = validation_generator, 
validation_steps = 10, 
callbacks=callbacks_list)

This is the version I have using DataImageGenerator:

train_datagen = keras.preprocessing.image.ImageDataGenerator(rescale =1/255,rotation_range = 40,width_shift_range = 0.2,height_shift_range = 0.2,shear_range = 0.2,zoom_range = 0.2,horizontal_flip = True)

train_generator = train_datagen.flow_from_directory(directory= train_dir,target_size = (50,50),batch_size = 32,class_mode = 'binary')

val_datagen = keras.preprocessing.image.ImageDataGenerator(rescale=1/255)
validation_generator = val_datagen.flow_from_directory(directory=validation_dir,target_size=(50,50),batch_size =40,class_mode ='binary')

This specific code (with this same number of epochs, steps_per_epoch and batchsize) was taken from the book deeplearning with python, by François Chollet, it's an example on page 141 of a data augmentation system. As you may have guessed, this produces the same results as the other method displayed.

The message does not mean what you think it means, also you did not share the actual fit call. — Dr. Snoopy, Jul 27 '22 at 21:07
What does it mean ? Also, I have edited to put the fit call as well — Gevezo, Jul 27 '22 at 22:00
It means the generator run out of data, not that your dataset is too small, the value of steps per epoch or validation steps is probably wrong, you assume new images are created but that is not how a generator with data augmentation works. — Dr. Snoopy, Jul 28 '22 at 00:22
How does it work then ? I know for a fact that the generator produces new images because that's the concept behind the data augmentation process, and that's what the documentation points to, but I don't understand why it can't produce images enough to sustain the 100 steps of the 20 epochs, even thouigh it can produce images enough to sustain 8 steps of 335 epochs — Gevezo, Aug 03 '22 at 14:12
There are multiple problems, and none is about "producing enough images". You should be using ImageDataGenerator which does data loading from a folder and data augmentation at the same time, and when using it, there is no need to specify steps per epoch or validation steps. You get the error because the value of steps per epoch is not correct. A generator does not really create new images, the model during training only sees the augmented images (not the original data). — Dr. Snoopy, Aug 03 '22 at 14:18
Tensorflow documentation says ImageDataGenerator is a soon-to-be deprecated method, and then suggest this ones instead. But I was working with ImageDataGenerator prior to this. Now I also tried to avoid defining steps_per_epoch, but then it just sets the steps to a very low number. Now, if the model only sees the augmented image, that means that the generator produced those images to train, right ? But if it can produce, why does i limit itself based on the number of original images ? — Gevezo, Aug 03 '22 at 14:26
I have noticed also that this limitation is set on the epoch context; the generator can produce as much images as needed, but not at once (in a single epoch, for example), is this to avoid overfitting ? — Gevezo, Aug 03 '22 at 14:28
I can also post the previous code when I was working with ImageDataGenerator, if you wish — Gevezo, Aug 03 '22 at 14:29
Nah in the end the problem is misunderstanding the steps per epoch, if you want more images, then you increase the number of epochs, not fiddle with steps per epoch. — Dr. Snoopy, Aug 03 '22 at 14:29
Wouldn't that be same anyways ? I say that because a cowrker managed to reproduce this code on R, and I was supposed to get it working on python with the same set of parameters for research purposes — Gevezo, Aug 03 '22 at 14:30
Keras does not have a native R version, it goes through a binding that nobody knows what is doing, so that information is kind of useless. — Dr. Snoopy, Aug 03 '22 at 14:48

score 1 · Accepted Answer · answered Aug 05 '22 at 19:46

When we state that data augmentation increases the number of instances, we usually understand that an altered version of a sample would be created for the model to process. It's just image preprocessing with randomness.

If you closely inspect your training log, you will get your solution, shown below. The main issue with your approach is simply discussed in this post.

WARNING:tensorflow:Your input ran out of data; interrupting training. Make sure that your dataset or generator can generate at least steps_per_epoch * epochs batches (in this case, 2000 batches). You may need to use the repeat() function when building your dataset.

So, to solve this, we can use .repeat() function. To understand what it does, you can check this answer. Here is the sample code that should work for you.

train_ds= keras.utils.image_dataset_from_directory(
    ...
 )
train_ds = train_ds.map(
      lambda x, y: (data_augmentation(x, training=True), y)
)
val_ds = keras.utils.image_dataset_from_directory(
   ...
)

# using .repeat function
train_ds = train_ds.repeat().shuffle(8 * batch_size)
train_ds = train_ds.prefetch(buffer_size=tf.data.AUTOTUNE)

val_ds = val_ds.repeat()
val_ds = val_ds.prefetch(buffer_size=tf.data.AUTOTUNE)

# specify step per epoch 
history = model.fit(
  train_ds,
  validation_data=val_ds,
  epochs=..,
  steps_per_epoch = train_ds.cardinality().numpy(),
  validation_steps = val_ds.cardinality().numpy(),
)

This actually works, but I'd like to leave this warning to people since I tried repeat() before reading this answer: Using repeat() alone will lead to overfitting, in order to make it work properly you must apply like this gentlemen suggested. Anyway, thank you — Gevezo, Aug 05 '22 at 23:33

Tensorflow does not apply data augmentation properly

1 Answers1