1

I'm trying to create an Image Classifier on a dataset with 40'000 images, in order to let Autokeras train the most appropriate model for me afterwards. Now the problem is, that every time I load all the images and get their labels but when I run the normalization Google Colab, there is a RAM overflow (although having a Pro+ account). Subsequently my code:

# Import TensorFlow
%tensorflow_version 2.x
import tensorflow as tf

import keras
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.utils import normalize, to_categorical
!pip install autokeras
import autokeras as ak

import glob
import numpy as np

images = glob.glob(path + '/*.png')

import random

data = []
labels = []

for i in images:
    image=tf.keras.preprocessing.image.load_img(i, color_mode='rgb')
    image=np.array(image, dtype ='float32')
    image=cv2.resize(image, (180, 180))
    image/=255.0
    data.append(image)
    label=os.path.basename(str(i.replace('.png', '')))
    label=label.split()[0]
    labels.append(label)

data = np.array(data)
labels = np.array(labels)
print(labels)

Up until here everything works like a charm, but then I face the problem which creates the overflow:

# normalize feature and encode label
X = data 
y = np.zeros(labels.shape)
indices = np.unique(labels)
for i in range(labels.shape[0]):
  y[i] = np.where(labels[i] == indices)[0]
y = to_categorical(y)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                random_state=42) # this line seems to be the problem where Colab has a RAM overflow

clf= ak.ImageClassifier(overwrite=True, max_trials=20)
clf.fit(X_train, y_train, epochs=10) # and also this line seems to be the problem where Colab has a RAM overflow.

Does anyone know what the problem might be? A hint in any direction would be highly appreciated, as this makes me going crazy!

I'mahdi
  • 23,382
  • 5
  • 22
  • 30
  • which normalization do you want? (pixels go between 0 and 1) – I'mahdi Jun 20 '22 at 07:26
  • @I'mahdi I updated the code now with the normalization which works now, but now the RAM overflow is created through the automatic train_test_split() and afterwards through the model.fit(). – Jeffrey Sachs Jun 20 '22 at 16:06
  • why don't you split data into multiple parts like 5_000 , 5_000, ... images and train your model multiple times, your model can continue training one the second split data and don't start training from first – I'mahdi Jun 20 '22 at 16:09
  • Maybe I'm just too inexperienced to do that (new to keras and tensorflow) - would you mind showing me how you would do that in the code? – Jeffrey Sachs Jun 20 '22 at 16:12
  • I send an answer, Is this helping you? – I'mahdi Jun 20 '22 at 18:17

1 Answers1

1

I highly recommend you tf.data.Dataset for creating the dataset:

  1. Do all processes (like resize and normalize) that you want on all images with dataset.map.
  2. Instead of using train_test_split, Use dataset.take, datast.skip for splitting dataset.

Code for generating random images and label:

# !pip install autokeras
import tensorflow as tf
import autokeras as ak
import numpy as np

data = np.random.randint(0, 255, (45_000,32,32,3))
label = np.random.randint(0, 10, 45_000)
label = tf.keras.utils.to_categorical(label) 

Convert data & label to tf.data.Dataset and process on them: (only 55 ms for 45_000, benchmark on colab)

dataset = tf.data.Dataset.from_tensor_slices((data, label))
def resize_normalize_preprocess(image, label):
    image = tf.image.resize(image, (16, 16))
    image = image / 255.0
    return image, label

# %%timeit 
dataset = dataset.map(resize_normalize_preprocess, num_parallel_calls=tf.data.AUTOTUNE)
# 1 loop, best of 5: 54.9 ms per loop
dataet_size = len(dataset)
train_size = int(0.8 * dataet_size)
test_size = int(0.2 * len(dataset))

dataset = dataset.shuffle(32)
train_dataset = dataset.take(train_size)
test_dataset = dataset.skip(train_size)

print(f'Size dataset : {len(dataset)}')
print(f'Size train_dataset : {len(train_dataset)}')
print(f'Size test_dataset : {len(test_dataset)}')

clf = ak.ImageClassifier(overwrite=True, max_trials=1)
clf.fit(train_dataset, epochs=1)
print(clf.evaluate(test_dataset))

Output:

Size dataset : 45000
Size train_dataset : 36000
Size test_dataset : 9000

Search: Running Trial #1

Value             |Best Value So Far |Hyperparameter
vanilla           |?                 |image_block_1/block_type
True              |?                 |image_block_1/normalize
False             |?                 |image_block_1/augment
3                 |?                 |image_block_1/conv_block_1/kernel_size
1                 |?                 |image_block_1/conv_block_1/num_blocks
2                 |?                 |image_block_1/conv_block_1/num_layers
True              |?                 |image_block_1/conv_block_1/max_pooling
False             |?                 |image_block_1/conv_block_1/separable
0.25              |?                 |image_block_1/conv_block_1/dropout
32                |?                 |image_block_1/conv_block_1/filters_0_0
64                |?                 |image_block_1/conv_block_1/filters_0_1
flatten           |?                 |classification_head_1/spatial_reduction_1/reduction_type
0.5               |?                 |classification_head_1/dropout
adam              |?                 |optimizer
0.001             |?                 |learning_rate

Result of searching and finding the best parameter and training:

Trial 1 Complete [00h 01m 16s]
val_loss: 2.3030436038970947

Best val_loss So Far: 2.3030436038970947
Total elapsed time: 00h 01m 16s
INFO:tensorflow:Oracle triggered exit
1125/1125 [==============================] - 68s 60ms/step - loss: 2.3072 - accuracy: 0.0979
INFO:tensorflow:Assets written to: ./image_classifier/best_model/assets
282/282 [==============================] - 26s 57ms/step - loss: 2.3025 - accuracy: 0.0970
[2.302501916885376, 0.09700000286102295]
I'mahdi
  • 23,382
  • 5
  • 22
  • 30
  • The problem is that the loading of the images of the dataset as an array to create the variable, is consuming too much RAM and overflows. That's why I do the preprocessing already in the beginning. For your understanding, the dataset is from [here](https://www.kaggle.com/datasets/joosthazelzet/lego-brick-images?resource=download-directory) and I therefore can't create it myself with randint... If I understood you wrong I'm truly sorry... – Jeffrey Sachs Jun 21 '22 at 00:19
  • @JeffreySachs,[here](https://stackoverflow.com/questions/72678293/ram-overflow-colab-when-running-model-fit-in-image-classifier-of-autokeras-fo/72691392?noredirect=1#comment128398117_72678293) You say, You get overflow on `train_test_split` then I write an answer for don't get overflow on `train_test_spit`. It's better you say exactly your intentions at first. this solution exactly correct. Yes, when you read the image and save in `list` then create `numpy.array` give you overflow but you say pass this converting by pro account colab. – I'mahdi Jun 21 '22 at 06:21
  • then I generate the random images for testing my answer. you can use this answer for your images and definitely correct and work. If you can not load all images I say [here](https://stackoverflow.com/questions/72678293/ram-overflow-colab-when-running-model-fit-in-image-classifier-of-autokeras-fo/72691392?noredirect=1#comment128398170_72678293) and again I say read images 5_000, 5_000 and train model multiple and in each training continue before training and don't start from first and you can solve your problem. – I'mahdi Jun 21 '22 at 06:23
  • 1
    @I‘mahdi, Thanks for your big effort and sorry for being unclear in my itentions with this question. I now accepted the answer and upvoted, as it answers the question. – Jeffrey Sachs Jun 21 '22 at 06:51
  • @JeffreySachs, If you want another method instead of writing images to `list` then to `numpy.array`. You can use [`flow_from_directory`](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/image/ImageDataGenerator#flow_from_directory) and You can read an example of this [here](https://www.kaggle.com/code/vbookshelf/cnn-how-to-use-160-000-images-without-crashing/notebook) – I'mahdi Jun 21 '22 at 07:03