11

I manually built a data generator that yields a tuple of [input, target] each call. I set my generator to shuffle the training samples every epoch. Then I use fit_generator to call my generator, but confuse at the "shuffle" argument in this function:

fit_generator(self, generator, steps_per_epoch=None, epochs=1, verbose=1, callbacks=None, validation_data=None, validation_steps=None, class_weight=None, max_queue_size=10, workers=1, use_multiprocessing=False, shuffle=True, initial_epoch=0)

From Keras API:

shuffle: Whether to shuffle the order of the batches at the beginning of each epoch. Only used with instances of Sequence (keras.utils.Sequence)

I thought "shuffle" should be the job of the generator. How can it shuffle the order of the batches when my custom generator decides which batch to be output in each iteration?

nbro
  • 15,395
  • 32
  • 113
  • 196
Tu Bui
  • 1,660
  • 5
  • 26
  • 39

1 Answers1

12

As the documentation you quoted says, the shuffle argument is only relevant for a generator that implements keras.utils.Sequence.

If you are using a "simple" generator (such as keras.preprocessing.image.ImageDataGenerator, or your own custom non-Sequence generator), than that generator implements a method that return a single batch (using yield - you can learn more about it in this question). Therefore, only the generator itself controls what batch is returned.

keras.utils.Sequence was introduced to support multiprocessing:

Sequence are a safer way to do multiprocessing. This structure guarantees that the network will only train once on each sample per epoch which is not the case with generators.

To that end, you need to implement a method that return a batch by a batch index (which allows synchronization of multiple workers): __getitem__(self, idx). If you enable the shuffle argument, the __getitem__ method will be invoked with indexes in a random order.

However, you may also set it to false, and shuffle yourself by implementing the on_epoch_end method.

Mark Loyman
  • 1,983
  • 1
  • 14
  • 23
  • 1
    so what happens if I use my own custom non-sequence generator and set _shuffle=True_ in _fit-generator_? – Tu Bui Apr 15 '18 at 13:38
  • 1
    Nothing happens. If you look at the source: https://github.com/keras-team/keras/blob/3b444513b52cf05e7d40f2ffdb7ab7283bb2ce06/keras/engine/training.py#L2168, The argument is only used when your generator is a Sequence. – Mark Loyman Apr 15 '18 at 15:33
  • In the method __getitem_(...), is there a way to know which worker (thread) id is grabbing that particular batch (identified by "idx")? The motivation for asking is I want to spread the workload across 2 workers constructing a separate dataset (e.g. negative samples). Ideally, this should be done in on_epoch_end, but it probably won't be done by multi-processes? – kawingkelvin Nov 15 '19 at 20:08