How to splitting training data into smaller batches to solve memory error

Question

I have a training data with two multidimensional arrays [prev_sentences, current_sentences], when I used simple model.fit method, It gives me memory error. I want to use fit_generator now but I don't know how to split the training data into batches to feed into model.fit_generator. Shapes of training data are (111356,126,1024) and (111356,126,1024) and y_train shape is (111356,19). Here is line of code for simple fit method.


history=model.fit([previous_sentences, current_sentences], y_train,
                  epochs=15,batch_size=256,
                  shuffle = False, verbose = 1,
                  validation_split=0.2,
                  class_weight=custom_weight_dict,
                  callbacks=[early_stopping_cb])

I have never used fit_generator and data generator so I have no idea exactly how to split these training data to be used fit_generator. Can anyone help me in creating batches using fit_generator?

score 2 · Answer 1 · answered Sep 11 '20 at 12:34

2

You just need to call:

model.fit_generator(generator, steps_per_epoch)

where steps_per_epoch is typically ceil(num_samples / batch_size) (as per the doc) and generator is a python generator which iterates over the data and yields the data batch-wise. Each call to the generator should then yield batch_size many elements. An example for a generator (source):

def generate_data(directory, batch_size):
    """Replaces Keras' native ImageDataGenerator."""
    i = 0
    file_list = os.listdir(directory)
    while True:
        image_batch = []
        for b in range(batch_size):
            if i == len(file_list):
                i = 0
                random.shuffle(file_list)
            sample = file_list[i]
            i += 1
            image = cv2.resize(cv2.imread(sample[0]), INPUT_SHAPE)
            image_batch.append((image.astype(float) - 128) / 128)

        yield np.array(image_batch)

Since this is absolutely problem-specific, you'll have to write your own generator, though it should be simple to do from this template.

answered Sep 11 '20 at 12:34

runDOSrun

10,359
7
47
57

Should I have to write this data_generator code for each of the training array e.g previous_sentences and current_sentences and also how I will handle validation split data that was meant to be splitted at run time with model.fit method – Aizayousaf Sep 11 '20 at 13:04
You should check the API and maybe look for some examples. [As you can see](https://www.tensorflow.org/api_docs/python/tf/keras/Model#fit_generator), there's a parameter for `validation_data`. If you have 2 inputs to the network, the generator would return both, see [here](https://stackoverflow.com/questions/49404993/keras-how-to-use-fit-generator-with-multiple-inputs/49405175) – runDOSrun Sep 11 '20 at 13:10
can you check the generate_data method and one more thing we also have to apply this method on validation data or their is no need of it? @runDOSrun – Aizayousaf Sep 11 '20 at 14:11
I have created a generator function that splits training data into batches of batch size, I also applied the generator function on validation data, but during traing, model shows only training loss and training accuracy, neither of validation loss nor validation accuracy. What is the reason for it? – Aizayousaf Sep 11 '20 at 15:40

score 0 · Accepted Answer · answered Sep 12 '20 at 06:46

This is the data generator to split the training data into mini batches:

def generate_data(X1,X2,Y,batch_size):
  p_input=[]
  c_input=[]
  target=[]
  batch_count=0
  for i in range(len(X1)):
    p_input.append(X1[i])
    c_input.append(X2[i])
    target.append(Y[i])
    batch_count+=1
    if batch_count>batch_size:
      prev_X=np.array(p_input,dtype=np.int64)
      cur_X=np.array(c_input,dtype=np.int64)
      cur_y=np.array(target,dtype=np.int32)
      print(len(prev_X),len(cur_X))
      yield ([prev_X,cur_X],cur_y ) 
      p_input=[]
      c_input=[]
      target=[]
      batch_count=0
  return

And here is fit_generator function call instead of model.fit method:

batch_size=256
epoch_steps=math.ceil(len(previous_sentences)/ batch_size)
hist = model.fit_generator(generate_data(previous_sentences,current_sentences, y_train, batch_size),
                steps_per_epoch=epoch_steps,
                callbacks = [early_stopping_cb],
                validation_data=generate_data(val_prev, val_curr,y_val,batch_size),
                validation_steps=val_steps,  class_weight=custom_weight_dict,
                 verbose=1)

How to splitting training data into smaller batches to solve memory error

2 Answers2