I'm working on a speech recognition model in Keras and so far I have this model:
input_data = Input(name='the_input', shape=(None, 100), dtype='float32')
x = Conv1D(filters=160, kernel_size=5, strides=2, padding='same', activation='relu')(x)
x = BatchNormalization()(x)
x = Conv1D(filters=160, kernel_size=11, strides=1, padding='same', activation='relu')(x)
x = BatchNormalization()(x)
x = GRU(160, return_sequences=True)(x)
x = BatchNormalization()(x)
x = GRU(160, return_sequences=True)(x)
x = BatchNormalization()(x)
x = GRU(160, return_sequences=True, )(x)
x = BatchNormalization()(x)
x = TimeDistributed(Dense(alphabet_length+1))(x)
y_pred = Activation('softmax')(x)
Model was trained on 1 second long utterances of words which were featurized into mel filterbanks with feature length of 100. This model works pretty good, but only if the input is the same length as the samples it was trained on, which is in this case 1 second of audio (98 timesteps after features are taken).
What I want to achieve is to be able to feed the model with shorter, for example, 250ms long audio chunks to get better performance on mobile and to achieve almost real time recognition from microphone. So far I was doing prediction on 1 sec long frames of audio with 0.5 seconds of overlap between each frame. This method has some issues like being unable to determine if some word was spoken twice or once in the row because one utterance can be recognized two times when it's contained in neighbouring frames.
After reading some great explanations from here written by Daniel Möller I came to conclusion that I'll have to train this model with stateful=True on my GRU layers and split my 1 second long word utterances into batches of 4 where each batch represents 250ms of audio. I will also call model.reset_states() in my custom callback after every 4 batches. The problem is I'm using CTC as loss function and output from my batch generator looks like this:
outputs = {'ctc': np.zeros([minibatch_size])}
inputs = {'the_input': X_data,
'the_labels': labels,
'input_length': input_length,
'label_length': label_length,
}
So the question is, am I now expected to provide input_length and label_length of those 250ms chunks or whole 1 second audio? I can't see how any of this options could work since my X_data has now shape of timesteps contained in 250ms audio instead of 1 second audio and I can't determine how would 250ms label look like.
Any advice, link or suggestion for different architecture is welcome.
PS: my reasoning for using 1 second long word utterances as training data (instead of longer sentences) comes from the fact that I expect these words to come in any random sequence and they are not supposed to form any meaningful sentence, plus most of them are short enough to be contained in 1 second window.