Can't get past first epoch -- just hangs [Keras Transfer Learning Inception]

Question

I'm basically using most of the code from Keras Inception transfer learning API tutorial,

https://faroit.github.io/keras-docs/2.0.0/applications/#inceptionv3

just a few minor changes to fit my data.

I'm using Tensorflow-gpu 1.4, Windows 7 and Keras 2.03(? latest Keras).

CODE:

from keras.applications.inception_v3 import InceptionV3
from keras.preprocessing import image
from keras.preprocessing.image import ImageDataGenerator
from keras.models import Model
from keras.layers import Dense, GlobalAveragePooling2D
from keras import backend as K


img_width, img_height = 299, 299
train_data_dir = r'C:\Users\Moondra\Desktop\Keras Applications\data\train'
nb_train_samples = 8
nb_validation_samples = 100 
batch_size = 10
epochs = 5


train_datagen = ImageDataGenerator(
rescale = 1./255,
horizontal_flip = True,
zoom_range = 0.1,
rotation_range=15)



train_generator = train_datagen.flow_from_directory(
train_data_dir,
target_size = (img_height, img_width),
batch_size = batch_size, 
class_mode = 'categorical')  #class_mode = 'categorical'


# create the base pre-trained model
base_model = InceptionV3(weights='imagenet', include_top=False)

# add a global spatial average pooling layer
x = base_model.output
x = GlobalAveragePooling2D()(x)
# let's add a fully-connected layer
x = Dense(1024, activation='relu')(x)
# and a logistic layer -- let's say we have 200 classes
predictions = Dense(12, activation='softmax')(x)

# this is the model we will train
model = Model(input=base_model.input, output=predictions)

# first: train only the top layers (which were randomly initialized)
# i.e. freeze all convolutional InceptionV3 layers
for layer in base_model.layers:
    layer.trainable = False

# compile the model (should be done *after* setting layers to non-trainable)
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')

# train the model on the new data for a few epochs
model.fit_generator(
train_generator,
steps_per_epoch = 5,
epochs = epochs)


# at this point, the top layers are well trained and we can start fine-tuning
# convolutional layers from inception V3. We will freeze the bottom N layers
# and train the remaining top layers.

# let's visualize layer names and layer indices to see how many layers
# we should freeze:
for i, layer in enumerate(base_model.layers):
   print(i, layer.name)

# we chose to train the top 2 inception blocks, i.e. we will freeze
# the first 172 layers and unfreeze the rest:
for layer in model.layers[:172]:
   layer.trainable = False
for layer in model.layers[172:]:
   layer.trainable = True

# we need to recompile the model for these modifications to take effect
# we use SGD with a low learning rate
from keras.optimizers import SGD
model.compile(optimizer=SGD(lr=0.0001, momentum=0.9), loss='categorical_crossentropy')

# we train our model again (this time fine-tuning the top 2 inception blocks
# alongside the top Dense layers
model.fit_generator(
train_generator,
steps_per_epoch = 5,
epochs = epochs)

OUTPUT (Can't get past the first epoch):

 Epoch 1/5

1/5 [=====>........................] - ETA: 8s - loss: 2.4869
2/5 [===========>..................] - ETA: 3s - loss: 5.5591
3/5 [=================>............] - ETA: 1s - loss: 6.6299

4/5 [=======================>......] - ETA: 0s - loss: 8.4925

It just hangs here.

UPDATE:

I created a virtual env with tensorflow 1.3 (downgrade one version down) and Keras 2.03(latest pip version) and still having the same problem.

UPDATE 2

I don't think it's a memory issue as if I change the steps within the epoch -- it will run fine all the way to the last step, and just freeze.

So 30 steps in an epoch and it will run till 29.

5 steps and it will run till the 4th step and then just hang.

Update 3

Also tried layers 249 as suggested in the Keras API.

Your code seems fine to me. It may be memory overflow problem, please check your memory. Also please check number of layers in inceptionv3 network, currently you are considering 172 layers. — Tushar Gupta, Nov 20 '17 at 09:05
@TusharGupta I will check the layers on inception model but I'm assuming it's correct as this code was provided on the official Kera API page. As for memory leak -- I"m not exactly sure how to check that. I'm using tf as the backened, and tensorflow is known to allocate all free memory to itself even if it's not using the memory. So everytime I use a graphics card monitoring tool, the memory is at 95%. Thank you. — Moondra, Nov 20 '17 at 16:44

score 1 · Accepted Answer · answered Mar 02 '18 at 04:49

1

Apparently it was a bug that got fixed via Keras update (However, some are still experiencing the problem)

answered Mar 02 '18 at 04:49

Moondra

4,399
9
46
104

score 1 · Answer 2 · answered Sep 07 '18 at 09:00

The same problem occurs for me with tensorflow.__version__==1.10.1 and keras.__version__==2.2.2 the fix for me was to downgrade keras to 2.2.0 using pip3 install -I keras==2.2.0. Note that this may break compatibility and you might require a downgrade of tensorflow as well.

score 1 · Answer 3 · answered Nov 20 '19 at 13:10

It seems that most of the freezing problems occur when some bug in the code happens. In my case, I build a generator which throws an exception on epoch end and the process stopped. But there were no messages about an exception, so I also spend some time to figure out what is going on.

score 0 · Answer 4 · answered Jan 27 '20 at 04:10

As mentioned by @thomas-e, I also had similar issue with the keras/tf compatibility. Specifically my config was: cuda-10.0, cudnn-7, tensorflow_gpu=1.14.0, keras=2.2.5.

Fixed it by downgrading to: cuda-9.0, cudnn-7, tensorflow-gpu=1.10.0 and keras=2.2.0

Got the idea about incompatibility from this article: https://github.com/tensorflow/tensorflow/issues/15604

Also, you can refer to the keras and tensorflow compatibility in the below articles:

score 0 · Answer 5 · edited Jan 10 '21 at 21:51

0

keeping these relations solved the problem:

steps_per_epoch = number of train samples/batch_size

validation_steps= number of validation samples/batch_size

more of the same@ https://github.com/keras-team/keras/issues/8595

edited Jan 10 '21 at 21:51

Luca Angioloni

2,243
2
19
28

answered Jan 10 '21 at 21:33

Guady Bird

1
1