Training a fully convolutional neural network with inputs of variable size takes unreasonably long time in Keras/TensorFlow

Question

I am trying to implement a FCNN for image classification that can accept inputs of variable size. The model is built in Keras with TensorFlow backend.

Consider the following toy example:

model = Sequential()

# width and height are None because we want to process images of variable size 
# nb_channels is either 1 (grayscale) or 3 (rgb)
model.add(Convolution2D(32, 3, 3, input_shape=(nb_channels, None, None), border_mode='same'))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Convolution2D(32, 3, 3, border_mode='same'))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Convolution2D(16, 1, 1))
model.add(Activation('relu'))

model.add(Convolution2D(8, 1, 1))
model.add(Activation('relu'))

# reduce the number of dimensions to the number of classes
model.add(Convolution2D(nb_classses, 1, 1))
model.add(Activation('relu'))

# do global pooling to yield one value per class
model.add(GlobalAveragePooling2D())

model.add(Activation('softmax'))

This model runs fine but I am running into a performance issue. Training on images of variable size takes unreasonably long time compared to training on the inputs of fixed size. If I resize all images to the maximum size in the data set it still takes far less time to train the model than training on the variable size input. So is input_shape=(nb_channels, None, None) the right way to specify variable size input? And is there any way to mitigate this performance problem?

Update

model.summary() for a model with 3 classes and grayscale images:

Layer (type)                     Output Shape          Param #     Connected to                     
====================================================================================================
convolution2d_1 (Convolution2D)  (None, 32, None, None 320         convolution2d_input_1[0][0]      
____________________________________________________________________________________________________
activation_1 (Activation)        (None, 32, None, None 0           convolution2d_1[0][0]            
____________________________________________________________________________________________________
maxpooling2d_1 (MaxPooling2D)    (None, 32, None, None 0           activation_1[0][0]               
____________________________________________________________________________________________________
convolution2d_2 (Convolution2D)  (None, 32, None, None 9248        maxpooling2d_1[0][0]             
____________________________________________________________________________________________________
maxpooling2d_2 (MaxPooling2D)    (None, 32, None, None 0           convolution2d_2[0][0]            
____________________________________________________________________________________________________
convolution2d_3 (Convolution2D)  (None, 16, None, None 528         maxpooling2d_2[0][0]             
____________________________________________________________________________________________________
activation_2 (Activation)        (None, 16, None, None 0           convolution2d_3[0][0]            
____________________________________________________________________________________________________
convolution2d_4 (Convolution2D)  (None, 8, None, None) 136         activation_2[0][0]               
____________________________________________________________________________________________________
activation_3 (Activation)        (None, 8, None, None) 0           convolution2d_4[0][0]            
____________________________________________________________________________________________________
convolution2d_5 (Convolution2D)  (None, 3, None, None) 27          activation_3[0][0]               
____________________________________________________________________________________________________
activation_4 (Activation)        (None, 3, None, None) 0           convolution2d_5[0][0]            
____________________________________________________________________________________________________
globalaveragepooling2d_1 (Global (None, 3)             0           activation_4[0][0]               
____________________________________________________________________________________________________
activation_5 (Activation)        (None, 3)             0           globalaveragepooling2d_1[0][0]   
====================================================================================================
Total params: 10,259
Trainable params: 10,259
Non-trainable params: 0

Maybe it's because memory reallocation for tensors, because each new batch have new spartial dimensions? — mrgloom, Apr 25 '17 at 16:07
It's probably to dynamic compilation of model - when adjusting a new computational graph. — Marcin Możejko, Oct 29 '17 at 19:43

score 1 · Answer 1 · answered Sep 06 '19 at 14:07

I think @marcin-możejko may have the right answer in his comment. It may be related to this bug, which was just fixed. And this patch may warn you if things are being compiled too often.

So upgrading to a tf-nightly-gpu-2.0-preview package may fix this. Also do you get this problem with tf.keras.

If I resize all images to the maximum size in the data set it still takes far less time to train the model than training on the variable size input

Note that for basic convolutions with "same" padding, zero padding should have "no" effect on the output, aside from pixel alignment.

So one approach would be to train on a fixed list of sizes and zero pad images to those sizes. For example and train on batches of 128x128, 256x256, 512x512. If you can't fix the dynamic compilation thing this at least would only compile it 3 times. This would be a bit like a 2d "bucket-by-sequence-length" approach sometimes seen with sequence models.

score 0 · Answer 2 · answered Sep 29 '18 at 21:09

Images of different sizes implies images of similar things at a different scale. If this difference in scale is significant the relative position of the similar things will shift from being in the centre of the frame towards the top left as the image size reduces. The (simple) network architecture shown is spatially aware so it would be consistent for the rate of model convergence to degrade as data of a very different scale would be inconsistent. This architecture is not well suited to finding the same thing in different or multiple places.

A certain degree of shearing, rotation, mirroring would help the model generalise, but re-scaled to a consistent size. So, when you re-size you fix the scaling issue and make the input data spatially consistent.

In short, I think it’s that this network architecture is not suited / capable for the task you are giving it i.e. various scales.

Training a fully convolutional neural network with inputs of variable size takes unreasonably long time in Keras/TensorFlow

2 Answers2

Linked