17

I originally tried to use generator syntax when writing a custom generator for training a Keras model. So I yielded from __next__. However, when I would try to train my mode with model.fit_generator I would get an error that my generator was not an iterator. The fix was to change yield to return which also necessitated rejiggering the logic of __next__ to track state. It's quite cumbersome compared to letting yield do the work for me.

Is there a way I can make this work with yield? I will need to write several more iterators that will have to have very clunky logic if I have to use a return statement.

doogFromMT
  • 205
  • 1
  • 2
  • 5

4 Answers4

26

I can't help debug your code since you didn't post it, but I abbreviated a custom data generator I wrote for a semantic segmentation project for you to use as a template:

def generate_data(directory, batch_size):
    """Replaces Keras' native ImageDataGenerator."""
    i = 0
    file_list = os.listdir(directory)
    while True:
        image_batch = []
        for b in range(batch_size):
            if i == len(file_list):
                i = 0
                random.shuffle(file_list)
            sample = file_list[i]
            i += 1
            image = cv2.resize(cv2.imread(sample[0]), INPUT_SHAPE)
            image_batch.append((image.astype(float) - 128) / 128)

        yield np.array(image_batch)

Usage:

model.fit_generator(
    generate_data('~/my_data', batch_size),
    steps_per_epoch=len(os.listdir('~/my_data')) // batch_size)
Jessica Alan
  • 690
  • 1
  • 7
  • 11
  • Thanks for this. I was trying to do it by passing an instance of a class with a `yield` statement inside the `__next__` method of the class. Your way points out another route, so I'll give this a try. – doogFromMT Sep 30 '17 at 15:55
  • @Jessica Alan When while loop will stop in `while True:`?? – N.IT Nov 08 '18 at 20:19
  • @N.IT I recommend researching Python generators. In a nutshell, use of the `yield` statement causes the function to "pause" until it is called again. The loop ends when `generate_data()` stops being called by whatever method is invoking it (`model.fit_generator()` in the example). – Jessica Alan Dec 28 '18 at 17:04
  • 2
    Where do you specify the labels? – Ben Jones May 03 '19 at 19:17
  • I created a label_batch the same way I created image_batch, then `yield (np.array(image_batch), np.array(label_batch))`. – Jessica Alan May 06 '19 at 18:26
21

I have recently played with the generators for Keras and I finally managed to prepare an example. It uses random data, so trying to teach NN on it makes no sense, but it's a good illustration of using a python generator for Keras.

Generate some data

import numpy as np
import pandas as pd
data = np.random.rand(200,2)
expected = np.random.randint(2, size=200).reshape(-1,1)

dataFrame = pd.DataFrame(data, columns = ['a','b'])
expectedFrame = pd.DataFrame(expected, columns = ['expected'])

dataFrameTrain, dataFrameTest = dataFrame[:100],dataFrame[-100:]
expectedFrameTrain, expectedFrameTest = expectedFrame[:100],expectedFrame[-100:]

Generator

def generator(X_data, y_data, batch_size):

  samples_per_epoch = X_data.shape[0]
  number_of_batches = samples_per_epoch/batch_size
  counter=0

  while 1:

    X_batch = np.array(X_data[batch_size*counter:batch_size*(counter+1)]).astype('float32')
    y_batch = np.array(y_data[batch_size*counter:batch_size*(counter+1)]).astype('float32')
    counter += 1
    yield X_batch,y_batch

    #restart counter to yeild data in the next epoch as well
    if counter >= number_of_batches:
        counter = 0

Keras model

from keras.datasets import mnist
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation, Flatten, Reshape
from keras.layers.convolutional import Convolution1D, Convolution2D, MaxPooling2D
from keras.utils import np_utils


model = Sequential()
model.add(Dense(12, activation='relu', input_dim=dataFrame.shape[1]))
model.add(Dense(1, activation='sigmoid'))


model.compile(loss='binary_crossentropy', optimizer='adadelta', metrics=['accuracy'])

#Train the model using generator vs using the full batch
batch_size = 8

model.fit_generator(
    generator(dataFrameTrain,expectedFrameTrain,batch_size),
    epochs=3,
    steps_per_epoch = dataFrame.shape[0]/batch_size,
    validation_data = generator(dataFrameTest,expectedFrameTest,batch_size*2),
    validation_steps = dataFrame.shape[0]/batch_size*2
)

#without generator
#model.fit(
#    x = np.array(dataFrame),
#    y = np.array(expected),
#    batch_size = batch_size,
#    epochs = 3
#)

Output

Epoch 1/3
25/25 [==============================] - 3s - loss: 0.7297 - acc: 0.4750 - 
val_loss: 0.7183 - val_acc: 0.5000
Epoch 2/3
25/25 [==============================] - 0s - loss: 0.7213 - acc: 0.3750 - 
val_loss: 0.7117 - val_acc: 0.5000
Epoch 3/3
25/25 [==============================] - 0s - loss: 0.7132 - acc: 0.3750 - 
val_loss: 0.7065 - val_acc: 0.5000
Caesar
  • 6,733
  • 4
  • 38
  • 44
Vaasha
  • 881
  • 1
  • 10
  • 19
  • 3
    That `model.fit_generator` line was painful to read, please consider adding some carriage return when you write oneliners like this one – Arthur Attout Feb 04 '19 at 21:53
  • 1
    Suppose to be `validation_steps = dataFrameTest.shape[0]/batch_size*2`. Also, `fit_generator()` is deprecated in TensorFlow (I think since v.2.0) and you should pass generator to `model.fit()` instead – haimco May 26 '20 at 18:12
  • Get `tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled` error in Tensorflow 2.1.0. – Ynjxsjmh May 25 '21 at 04:36
1

This is the way I implemented it for reading files any size. And it works like a charm.

import pandas as pd

hdr=[]
for i in range(num_labels+num_features):
    hdr.append("Col-"+str(i)) # data file do not have header so I need to
                              # provide one for pd.read_csv by chunks to work

def tgen(filename):
    csvfile = open(filename)
    reader = pd.read_csv(csvfile, chunksize=batch_size,names=hdr,header=None)
    while True:
    for chunk in reader:
        W=chunk.values        # labels and features
        Y =W[:,:num_labels]   # labels 
        X =W[:,num_labels:]   # features
        X= X / 255            # any required transformation
        yield X, Y
    csvfile = open(filename)
    reader = pd.read_csv(csvfile, chunksize=batchz,names=hdr,header=None)

The back in the main I have

nval=number_of_validation_samples//batchz
ntrain=number_of_training_samples//batchz
ftgen=tgen("training.csv")
fvgen=tgen("validation.csv")

history = model.fit_generator(ftgen,
                steps_per_epoch=ntrain,
                validation_data=fvgen,
                validation_steps=nval,
                epochs=number_of_epochs,
                callbacks=[checkpointer, stopper],
                verbose=2)
0

I would like to upgrade Vaasha's code with TensorFlow 2.x to achieve training efficiencies as well as ease of data processing. This is particularly useful for image processing.

Process the data using Generator function as Vaasha had generated in the above example or using tf.data.dataset API. The latter approach is very useful when processing any datasets with metadata. For example, MNIST data can be loaded and processed with a few statements.

import tensorflow as tf # Ensure that TensorFlow 2.x is used
tf.compat.v1.enable_eager_execution()
import tensorflow_datasets as tfds # Needed if you are using any of the tf datasets such as MNIST, CIFAR10
mnist_train = tfds.load(name="mnist", split="train")

Use tfds.load the datasets. Once data is loaded and processed (for example, converting categorical variables, resizing, etc.).

Now upgrading keras model using TensorFlow 2.x

 model = tf.keras.Sequential() # Tensorflow 2.0 upgrade
 model.add(tf.keras.layers.Dense(12, activation='relu', input_dim=dataFrame.shape[1]))
 model.add(tf.keras.layers.Dense(1, activation='sigmoid'))


 model.compile(loss='binary_crossentropy',
               optimizer='adadelta',
               metrics=['accuracy'])

 #Train the model using generator vs using the full batch
 batch_size = 8

 model.fit_generator(generator(dataFrameTrain,expectedFrameTrain,batch_size),
                epochs=3,
                steps_per_epoch=dataFrame.shape[0]/batch_size,
                validation_data=generator(dataFrameTest,expectedFrameTest,batch_size*2),
                validation_steps=dataFrame.shape[0]/batch_size*2)

This will upgrade the model to run in TensorFlow 2.x

Asif Patankar
  • 97
  • 4
  • 15
Anigasan
  • 11
  • 2