I am trying to understand how CTC loss is working for speech recognition and how it can be implemented in Keras.
- What i think i understood (please correct me if i'm wrong!)
Grossly, the CTC loss is added on top of a classical network in order to decode a sequential information element by element (letter by letter for text or speech) rather than directly decoding an element block directly (a word for example).
Let's say we're feeding utterances of some sentences as MFCCs.
The goal in using CTC-loss is to learn how to make each letter match the MFCC at each time step. Thus, the Dense+softmax output layer is composed by as many neurons as the number of elements needed for the composition of the sentences:
- alphabet (a, b, ..., z)
- a blank token (-)
- a space (_) and an end-character (>)
Then, the softmax layer has 29 neurons (26 for alphabet + some special characters).
To implement it, i found that i can do something like this:
# CTC implementation from Keras example found at https://github.com/keras-
# team/keras/blob/master/examples/image_ocr.py
def ctc_lambda_func(args):
y_pred, labels, input_length, label_length = args
# the 2 is critical here since the first couple outputs of the RNN
# tend to be garbage:
# print "y_pred_shape: ", y_pred.shape
y_pred = y_pred[:, 2:, :]
# print "y_pred_shape: ", y_pred.shape
return K.ctc_batch_cost(labels, y_pred, input_length, label_length)
input_data = Input(shape=(1000, 20))
#let's say each MFCC is (1000 timestamps x 20 features)
x = Bidirectional(lstm(...,return_sequences=True))(input_data)
x = Bidirectional(lstm(...,return_sequences=True))(x)
y_pred = TimeDistributed(Dense(units=ALPHABET_LENGTH, activation='softmax'))(x)
loss_out = Lambda(function=ctc_lambda_func, name='ctc', output_shape=(1,))(
[y_pred, y_true, input_length, label_length])
model = Model(inputs=[input_data, y_true, input_length,label_length],
outputs=loss_out)
With ALPHABET_LENGTH = 29 (alphabet length + special characters)
And:
- y_true: tensor (samples, max_string_length) containing the truth labels.
- y_pred: tensor (samples, time_steps, num_categories) containing the prediction, or output of the softmax.
- input_length: tensor (samples, 1) containing the sequence length for each batch item in y_pred.
- label_length: tensor (samples, 1) containing the sequence length for each batch item in y_true.
(source)
Now, i'm facing some problems:
- What i don't understand
- Is this implantation the right way to code and use CTC loss?
- I do not understand what are concretely y_true, input_length and label_length. Any examples?
- In what form should I give the labels to the network? Again, Any examples?