Understanding ConvLSTM2D by Stacking Convolution2D and LSTM layers using TimeDistributed to get similar results

Question

I have 950 training video samples and 50 testing video samples. Each video sample has 10 frames and each frame has a shape of (n_row=28, n_col=28, n_channels=1). My inputs (x) and outputs (y) have same shapes.

x_train shape: (950, 10, 28, 28,1),

y_train shape: (950, 10, 28, 28,1),

x_test shape: (50, 10, 28, 28,1),

y_test shape: (50, 10, 28, 28,1).

I want to give input video samples (x) as input to my model to predict output video samples (y).

My model so far is:

from keras.layers import Dense, Dropout, Activation, LSTM
from keras.layers import Convolution2D, MaxPooling2D, Flatten, Reshape
from keras.models import Sequential
from keras.layers.wrappers import TimeDistributed

import numpy as np
########################################################################################
model = Sequential()

model.add(TimeDistributed(Convolution2D(16, (3, 3), padding='same'), input_shape=(None, 28, 28, 1))) 
model.add(Activation('sigmoid'))
model.add(TimeDistributed(MaxPooling2D(pool_size=(2, 2))))
model.add(Dropout(0.2))

model.add(TimeDistributed(Convolution2D(32, (3, 3), padding='same'))) 
model.add(Activation('sigmoid'))
model.add(TimeDistributed(MaxPooling2D(pool_size=(2, 2))))
model.add(Dropout(0.2))

model.add(TimeDistributed(Convolution2D(64, (3, 3), padding='same'))) 
model.add(Activation('sigmoid'))
model.add(TimeDistributed(MaxPooling2D(pool_size=(2, 2))))

model.add(TimeDistributed(Flatten()))

model.add(LSTM(64, return_sequences=True, stateful=False))
model.add(LSTM(64, return_sequences=True, stateful=False))
model.add(Activation('sigmoid'))
model.add(Dense(784, activation='sigmoid'))
model.add(Reshape((-1, 28,28,1)))

model.compile(loss='mean_squared_error', optimizer='rmsprop')
print(model.summary())

summary of the model is:

Layer (type)                 Output Shape              Param #   
=================================================================
time_distributed_1 (TimeDist (None, None, 28, 28, 16)  160       
_________________________________________________________________
activation_1 (Activation)    (None, None, 28, 28, 16)  0         
_________________________________________________________________
time_distributed_2 (TimeDist (None, None, 14, 14, 16)  0         
_________________________________________________________________
dropout_1 (Dropout)          (None, None, 14, 14, 16)  0         
_________________________________________________________________
time_distributed_3 (TimeDist (None, None, 14, 14, 32)  4640      
_________________________________________________________________
activation_2 (Activation)    (None, None, 14, 14, 32)  0         
_________________________________________________________________
time_distributed_4 (TimeDist (None, None, 7, 7, 32)    0         
_________________________________________________________________
dropout_2 (Dropout)          (None, None, 7, 7, 32)    0         
_________________________________________________________________
time_distributed_5 (TimeDist (None, None, 7, 7, 64)    18496     
_________________________________________________________________
activation_3 (Activation)    (None, None, 7, 7, 64)    0         
_________________________________________________________________
time_distributed_6 (TimeDist (None, None, 3, 3, 64)    0         
_________________________________________________________________
time_distributed_7 (TimeDist (None, None, 576)         0         
_________________________________________________________________
lstm_1 (LSTM)                (None, None, 64)          164096    
_________________________________________________________________
lstm_2 (LSTM)                (None, None, 64)          33024     
_________________________________________________________________
activation_4 (Activation)    (None, None, 64)          0         
_________________________________________________________________
dense_1 (Dense)              (None, None, 784)         50960     
_________________________________________________________________
reshape_1 (Reshape)          (None, None, 28, 28, 1)   0         
=================================================================
Total params: 271,376
Trainable params: 271,376
Non-trainable params: 0

I know my model has problems but I don't know how to correct it.

I guess maybe model.add(Reshape((-1,28,28,1))) doesn't work properly. To be honest, I didn't know how to deal with the output of model.add(Dense(784, activation='sigmoid')). So I put a Reshape layer to make it proper. Or maybe LSTM layers cannot detect time correlation correctly, due to my current design.

EDIT 1: I changed all of Convolution2D activations from sigmoid to relu. here is the result of prediction of the changed model. As it is shown, it's not able to do a reasonable prediction for now.

EDIT 2: I changed model.add(Reshape((-1, 28,28,1))) to model.add(TimeDistributed(Reshape((28,28,1)))) and increased LSTM units to 512 and used 2 layer of LSTMs. Also used BatchNormalization and changed input_shape to (10, 28, 28, 1). By using this input shape, I can produce a many to many model.

But predictions didn't change much. I think I'm ignoring something fundamental. Here is the new model:

# from keras.layers import Dense, Dropout, Activation, LSTM 
from keras.layers.normalization import BatchNormalization
from keras.layers import Lambda, Convolution2D, MaxPooling2D, Flatten, Reshape, Conv2D
from keras.layers.convolutional import Conv3D
from keras.models import Sequential
from keras.layers.wrappers import TimeDistributed
from keras.layers.pooling import GlobalAveragePooling1D
from keras.optimizers import SGD
from keras.utils import np_utils
from keras.models import Model
import keras.backend as K

import numpy as np

import pylab as plt
model = Sequential()


model.add(TimeDistributed(Convolution2D(16, (3, 3), activation='relu', kernel_initializer='glorot_uniform', padding='same'), input_shape=(10, 28, 28, 1))) 
model.add(TimeDistributed(BatchNormalization()))
model.add(TimeDistributed(Convolution2D(32, (3,3), activation='relu')))
model.add(TimeDistributed(BatchNormalization()))
model.add(TimeDistributed(MaxPooling2D((2, 2), strides=(1, 1))))
model.add(Dropout(0.3))

model.add(TimeDistributed(Convolution2D(32, (3,3), activation='relu')))
model.add(TimeDistributed(BatchNormalization()))
model.add(TimeDistributed(Convolution2D(32, (3,3), activation='relu')))
model.add(TimeDistributed(BatchNormalization()))
model.add(TimeDistributed(MaxPooling2D((2, 2), strides=(1, 1))))
model.add(Dropout(0.3))

model.add(TimeDistributed(Convolution2D(32, (3,3), activation='relu')))
model.add(TimeDistributed(BatchNormalization()))
model.add(TimeDistributed(Convolution2D(32, (3,3), activation='relu')))
model.add(TimeDistributed(BatchNormalization()))
model.add(TimeDistributed(MaxPooling2D((2, 2), strides=(1, 1))))
model.add(Dropout(0.3))

# extract features and dropout 
model.add(TimeDistributed(Flatten()))
model.add(Dropout(0.3))
model.add(Dense(784, activation='linear'))
model.add(TimeDistributed(BatchNormalization()))

# input to LSTM
model.add(LSTM(units=512, activation='tanh', recurrent_activation='hard_sigmoid', kernel_initializer='glorot_uniform', unit_forget_bias=True, dropout=0.3, recurrent_dropout=0.3, return_sequences=True))
model.add(LSTM(units=512, activation='tanh', recurrent_activation='hard_sigmoid', kernel_initializer='glorot_uniform', unit_forget_bias=True, dropout=0.3, recurrent_dropout=0.3, return_sequences=True))

# classifier with sigmoid activation for multilabel
model.add(Dense(784, activation='linear'))
# model.add(TimeDistributed(BatchNormalization()))
model.add(TimeDistributed(Reshape((28,28,1))))
model.compile(loss='mae', optimizer='rmsprop')
print(model.summary())

EDIT 3: Because ConvLSTM2D does exactly the thing that I wanted, and the purpose of writing the question was to understand ConvLSTM2D, I changed the title of the question so that it better demonstrates my problem.

Because I want to understand what is behind this ConvLSTM2D. Recently, I tried to better understand how exactly LSTM works by reading this post: https://stackoverflow.com/questions/38714959/understanding-keras-lstms?rq=1 . Now I want to have a concrete understanding of what is behind the scene of ConvLSTM2D. I believe a lot of people like me get confused with how ConvLSTM2D does the magic. — Muser, Jul 24 '18 at 12:45

Understanding ConvLSTM2D by Stacking Convolution2D and LSTM layers using TimeDistributed to get similar results

0 Answers0