Understanding CTC loss for speech recognition in Keras

Question

I am trying to understand how CTC loss is working for speech recognition and how it can be implemented in Keras.

What i think i understood (please correct me if i'm wrong!)

Grossly, the CTC loss is added on top of a classical network in order to decode a sequential information element by element (letter by letter for text or speech) rather than directly decoding an element block directly (a word for example).

Let's say we're feeding utterances of some sentences as MFCCs.

The goal in using CTC-loss is to learn how to make each letter match the MFCC at each time step. Thus, the Dense+softmax output layer is composed by as many neurons as the number of elements needed for the composition of the sentences:

alphabet (a, b, ..., z)
a blank token (-)
a space (_) and an end-character (>)

Then, the softmax layer has 29 neurons (26 for alphabet + some special characters).

To implement it, i found that i can do something like this:

# CTC implementation from Keras example found at https://github.com/keras- 
# team/keras/blob/master/examples/image_ocr.py

def ctc_lambda_func(args):
    y_pred, labels, input_length, label_length = args
    # the 2 is critical here since the first couple outputs of the RNN
    # tend to be garbage:
    # print "y_pred_shape: ", y_pred.shape
    y_pred = y_pred[:, 2:, :]
    # print "y_pred_shape: ", y_pred.shape
    return K.ctc_batch_cost(labels, y_pred, input_length, label_length)



input_data = Input(shape=(1000, 20))
#let's say each MFCC is (1000 timestamps x 20 features)

x = Bidirectional(lstm(...,return_sequences=True))(input_data)

x = Bidirectional(lstm(...,return_sequences=True))(x)

y_pred = TimeDistributed(Dense(units=ALPHABET_LENGTH, activation='softmax'))(x)

loss_out = Lambda(function=ctc_lambda_func, name='ctc', output_shape=(1,))(
                  [y_pred, y_true, input_length, label_length])

model = Model(inputs=[input_data, y_true, input_length,label_length], 
                      outputs=loss_out)

With ALPHABET_LENGTH = 29 (alphabet length + special characters)

And:

y_true: tensor (samples, max_string_length) containing the truth labels.
y_pred: tensor (samples, time_steps, num_categories) containing the prediction, or output of the softmax.
input_length: tensor (samples, 1) containing the sequence length for each batch item in y_pred.
label_length: tensor (samples, 1) containing the sequence length for each batch item in y_true.

(source)

Now, i'm facing some problems:

What i don't understand
- Is this implantation the right way to code and use CTC loss?
- I do not understand what are concretely y_true, input_length and label_length. Any examples?
- In what form should I give the labels to the network? Again, Any examples?

score 10 · Accepted Answer · answered Aug 09 '19 at 00:17

What are these?

y_true your ground truth data. The data you are going to compare with the model's outputs in training. (On the other hand, y_pred is the model's calculated output)
input_length, the length (in steps, or chars this case) of each sample (sentence) in the y_pred tensor (as said here)
label_length, the length (in steps, or chars this case) of each sample (sentence) in the y_true (or labels) tensor.

It seems this loss expects that your model's outputs (y_pred) have different lengths, as well as your ground truth data (y_true). This is probably to avoid calculating the loss for garbage characters after the end of the sentences (since you will need a fixed size tensor for working with lots of sentences at once)

Form of the labels:

Since the function's documentation is asking for shape (samples, length), the format is that... the char index for each char in each sentence.

How to use this?

There are some possibilities.

1- If you don't care about lengths:

If all lengths are the same, you can easily use it as a regular loss:

def ctc_loss(y_true, y_pred):

    return K.ctc_batch_cost(y_true, y_pred, input_length, label_length)
    #where input_length and label_length are constants you created previously
    #the easiest way here is to have a fixed batch size in training 
    #the lengths should have the same batch size (see shapes in the link for ctc_cost)    

model.compile(loss=ctc_loss, ...)   

#here is how you pass the labels for training
model.fit(input_data_X_train, ground_truth_data_Y_train, ....)

2 - If you care about the lengths.

This is a little more complicated, you need that your model somehow tells you the length of each output sentence.
There are again several creative forms of doing this:

Have an "end_of_sentence" char and detect where in the sentence it is.
Have a branch of your model to calculate this number and round it to integer.
(Hardcore) If you are using stateful manual training loop, get the index of the iteration you decided to finish a sentence

I like the first idea, and will exemplify it here.

def ctc_find_eos(y_true, y_pred):

    #convert y_pred from one-hot to label indices
    y_pred_ind = K.argmax(y_pred, axis=-1)

    #to make sure y_pred has one end_of_sentence (to avoid errors)
    y_pred_end = K.concatenate([
                                  y_pred_ind[:,:-1], 
                                  eos_index * K.ones_like(y_pred_ind[:,-1:])
                               ], axis = 1)

    #to make sure the first occurrence of the char is more important than subsequent ones
    occurrence_weights = K.arange(start = max_length, stop=0, dtype=K.floatx())

    #is eos?
    is_eos_true = K.cast_to_floatx(K.equal(y_true, eos_index))
    is_eos_pred = K.cast_to_floatx(K.equal(y_pred_end, eos_index))

    #lengths
    true_lengths = 1 + K.argmax(occurrence_weights * is_eos_true, axis=1)
    pred_lengths = 1 + K.argmax(occurrence_weights * is_eos_pred, axis=1)

    #reshape
    true_lengths = K.reshape(true_lengths, (-1,1))
    pred_lengths = K.reshape(pred_lengths, (-1,1))

    return K.ctc_batch_cost(y_true, y_pred, pred_lengths, true_lengths)

model.compile(loss=ctc_find_eos, ....)

If you use the other option, use a model branch to calculate the lengths, concatenate these length to the first or last step of the output, and make sure you do the same with the true lengths in your ground truth data. Then, in the loss function, just take the section for lengths:

def ctc_concatenated_length(y_true, y_pred):

    #assuming you concatenated the length in the first step
    true_lengths = y_true[:,:1] #may need to cast to int
    y_true = y_true[:, 1:]

    #since y_pred uses one-hot, you will need to concatenate to full size of the last axis, 
    #thus the 0 here
    pred_lengths = K.cast(y_pred[:, :1, 0], "int32")
    y_pred = y_pred[:, 1:]

    return K.ctc_batch_cost(y_true, y_pred, pred_lengths, true_lengths)

Thank you for your answer. Can you give me an example of y_true, input_length and label_length for, let's say, "Hello world" as a (200 * 20) MFCC feature? I mean, how do i encode this label (y_true) for keras, if it is needed? If i understood, input_length will be the number of timestamp in the feature, so 200? And label_length is 11 because of the 11 characters of "Hello world"? — Baptiste Pouthier, Aug 13 '19 at 07:04
`y_true` are the indexes of your characters in your dictionary. Integers, shape `(sentences, characters)`. I don't know what `(200*20) MFCC feature` means. `input_length` is the number of characters your model outputs for this sentence. `label_length` is the number of characters in the true sentence (11). — Daniel Möller, Aug 13 '19 at 13:29