4

I have implemented a CNN-based regression model that uses a data generator to use the huge amount of data I have. Training and evaluation work well, but there's an issue with the prediction. If for example I want to predict values from a test dataset of 50 samples, I use model.predict with a batch size of 5. The problem is that model.predict returns 5 values repeated 10 times, instead of 50 different values . The same thing happens if I change to batch size to 1, it will return one value 50 times.

To solve this issue, I used a full batch size (50 in my example), and it worked. But I can't I use this method on my whole test data because it's too huge.

Do you have any other solution, or what is the problem in my approach?

My data generator code:

import numpy as np
import keras

class DataGenerator(keras.utils.Sequence):
    'Generates data for Keras'
    def __init__(self, list_IDs, data_X, data_Z, target_y batch_size=32, dim1=(120,120),
                 dim2 = 80, n_channels=1, shuffle=True):
        'Initialization'
        self.dim1 = dim1
        self.dim2 = dim2
        self.batch_size = batch_size
        self.data_X = data_X
        self.data_Z = data_Z
        self.target_y = target_y
        self.list_IDs = list_IDs
        self.n_channels = n_channels
        self.shuffle = shuffle
        self.on_epoch_end()

    def __len__(self):
        'Denotes the number of batches per epoch'
        return int(np.floor(len(self.list_IDs) / self.batch_size))

    def __getitem__(self, index):
        'Generate one batch of data'
        # Generate indexes of the batch
        indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]

        # Find list of IDs
        list_IDs_temp = [self.list_IDs[k] for k in range(len(indexes))]

        # Generate data
        ([X, Z], y) = self.__data_generation(list_IDs_temp)

        return ([X, Z], y)

    def on_epoch_end(self):
        'Updates indexes after each epoch'
        self.indexes = np.arange(len(self.list_IDs))
        if self.shuffle == True:
            np.random.shuffle(self.indexes)

    def __data_generation(self, list_IDs_temp):
        'Generates data containing batch_size samples' # X : (n_samples, *dim, n_channels)
        # Initialization
        X = np.empty((self.batch_size, *self.dim1, self.n_channels))
        Z = np.empty((self.batch_size, self.dim2))
        y = np.empty((self.batch_size))

        # Generate data
        for i, ID in enumerate(list_IDs_temp):
            # Store sample
            X[i,] = np.load('data/' + data_X + ID + '.npy')
            Z[i,] = np.load('data/' + data_Z + ID + '.npy')

            # Store target
            y[i] = np.load('data/' + target_y + ID + '.npy')

How I call model.predict()

predict_params = {'list_IDs': 'indexes', 
                  'data_X': 'images',
                  'data_Z': 'factors',
                  'target_y': 'True_values'
                  'batch_size': 5,
                  'dim1': (120,120),
                  'dim2': 80, 
                  'n_channels': 1,
                  'shuffle'=False}

# Prediction generator
prediction_generator = DataGenerator(test_index, **predict_params)

predition_results = model.predict(prediction_generator, steps = 1, verbose=1)
Derriese
  • 53
  • 4

3 Answers3

1

When you use a generator you specify a batch size. model.predict will produce batch size number of output predictions. If you set steps=1 that is all the predictions you will get. To set the steps you should take the number of samples you have and divide it by the batch size. For example if you have 50 images with a batch size of 5 then you should set steps equal to 10. Ideally you want to go through your test set exactly once. The code below will determine the batch size and steps to do that. In the code b_max is a value you select that limits the maximum batch size . You should set this based on your memory size to avoid a OOM (out of memory) error. In the code below parameter length is equal to the number of test samples you have.

length=500
b_max=80
batch_size=sorted([int(length/n) for n in range(1,length+1) if length % n ==0 and length/n<=b_max],reverse=True)[0]  
steps=int(length/batch_size)

The result will be batch_size= 50 steps=10. Note if length is a prime number the result will be batch_size=1 and steps=length

Gerry P
  • 7,662
  • 3
  • 10
  • 20
  • I've tried this, but for a batch_size = 12, the steps=1, and I only get 12 values predicted. – Derriese Jan 26 '21 at 18:56
  • what did you use for the number of samples? If your batch size is 12 and you specify 1 step you will get 12 predictions. – Gerry P Jan 26 '21 at 19:20
  • Yes you're right. But if I use the following code a number
    ```python length=50 b_max=1 batch_size=sorted([int(length/n) for n in range(1,length+1) if length % n ==0 and length/n<=b_max],reverse=True)[0] steps=int(length/batch_size) ```
    I will get batch size = 1, and steps = 50, but model.predict returns one value 50 times.
    – Derriese Jan 26 '21 at 20:35
  • 1
    That is what it should do. Are all 50 values the same? If they are then there is something wrong with your generator – Gerry P Jan 26 '21 at 23:05
  • Yes they're all the same. What might be the problem in my data generator? – Derriese Jan 27 '21 at 03:57
  • 1
    to see what is going on do data, labels= next(prediction_generator) then print out the data and labels and see if they are what you expect. – Gerry P Jan 27 '21 at 06:19
1

If we look at your __getitem__ function, we can see this code:

        list_IDs_temp = [self.list_IDs[k] for k in range(len(indexes))]

This code will always return the same numbers IDs, because the length len of the indexes is always the same (at least as long as all batches have an equal amount of samples) and we just loop over the first couple of indexes every time.

You are already extracting the indexes of the current batch beforehand, so the line with the error is not needed at all. The following code should work:

    def __getitem__(self, index):
        'Generate one batch of data'
        # Generate indexes of the batch
        list_IDs_temp = self.indexes[index*self.batch_size:(index+1)*self.batch_size]

        # Generate data
        ([X, Z], y) = self.__data_generation(list_IDs_temp)

        return ([X, Z], y)

See if this code works and you get different results. You should now get bad predictions, because during training, your model would also only have trained on the same few data points as of now.

a-doering
  • 1,149
  • 10
  • 21
0

According to this solution, you need to change your steps to the total number of images you want to test on. Try:

# Assuming test_index is a list
predition_results = model.predict(prediction_generator, steps = len(test_index), verbose=1)
a-doering
  • 1,149
  • 10
  • 21
  • I've tested it, and I get the same problem. And When I change it to steps = 1 with batch size 1, I get only one value predicted. – Derriese Jan 26 '21 at 14:06