2

I'm trying to do binary classification for labeled data for 300+ videos. The goal is to extract features using a ConvNet and feed into to an LSTM for sequencing with a binary output after evaluating all the frames in the video. I've preprocessed each video to have exactly 200 frames with each image being 256 x 256 so that it would be easier to feed into a DNN and split the dataset into two folders as labels. (e.g. dog and cat)

However, after searching stackoverflow for hours, I'm still unsure how to reshape the dataset of video frames so that the model accounts for the number of frames. I'm trying to feed the video frames into a 3D ConvNets and TimeDistributed (2DConvNets) + LSTM, (e.g. (300, 200, 256, 256, 3) ) with no luck. I'm able to perform 2D ConvNet classification (data is a 4D Tensor, need to add a time step dimension to make it a 5D Tensor ) pretty easily but now having issues wrangling with the temporal aspect.

I've been using Keras ImageDataGenerator and train_datagen.flow_from_directory to read in the images and have been running into shape mismatch errors when I attempt to feed it to a TimeDistributed ConvNet. I know hypothetically if I have a X_train dataset I can potentially do X_train = X_train.reshape(...). Any example code would be very much appreciated.

starlord
  • 21
  • 3

1 Answers1

1

I think you could use ConvLSTM2D in Keras for your purpose. ImageDataGenerator is very good for CNN with images, but may be not convenient for CRNN with videos.

You have already transformed your 300 videos data in the same shape (200, 256, 256, 3), each video 200 frames, each frame 256x256 rgb. Next, you need to load them in a numpy array in shape (300, 200, 256, 256, 3). For reading videos in numpy arrays see this answer.

Then you can feed the data in a CRNN. Its first ConvLSTM2D layer should have input_shape = (None, 200, 256, 256, 3).

A sample according to your data: (only illustrated and not tested)

from keras.models import Sequential
from keras.layers import Dense
from keras.layers.convolutional_recurrent import ConvLSTM2D

model = Sequential()
model.add(ConvLSTM2D(filters = 32, kernel_size = (5, 5), input_shape = (None, 200, 256, 256, 3)))
### model.add(...more layers)
model.add(Dense(units = num_of_categories, # num of your vedio categories
                kernel_initializer = 'Orthogonal', activation = 'softmax'))
model.compile(optimizer = 'adam', loss = 'categorical_crossentropy', metrics = ['accuracy'])

# then train it
model.fit(video_data, # shape (300, 200, 256, 256, 3)
          [list of categories],
          batch_size = 20,
          epochs = 50,
          validation_split = 0.1)

I hope this could be a little helpful.

GoFindTruth
  • 210
  • 2
  • 6
  • 12
  • Hi, I am getting the error:- ValueError: Input 0 is incompatible with layer conv_lst_m2d_12: expected ndim=5, found ndim=6 – Saikat Oct 03 '19 at 18:12
  • @Saikat it seems your input data are transformed into something with 6 dimensions. Check that if input data are prepared incorrectly. – GoFindTruth Oct 04 '19 at 08:45
  • 1
    the batch dimension should not be included in the input_shape. So instead of input_shape=(None, 200, 256, 256, 3), it should be (200, 256, 256, 3). @Saikat that's why you are receiving this ValueError .. – Yousif H Nov 25 '19 at 13:49