Using Tensorflow's Connectionist Temporal Classification (CTC) implementation

Question

I'm trying to use the Tensorflow's CTC implementation under contrib package (tf.contrib.ctc.ctc_loss) without success.

First of all, anyone know where can I read a good step-by-step tutorial? Tensorflow's documentation is very poor on this topic.
Do I have to provide to ctc_loss the labels with the blank label interleaved or not?
I could not be able to overfit my network even using a train dataset of length 1 over 200 epochs. :(
How can I calculate the label error rate using tf.edit_distance?

Here is my code:

with graph.as_default():

  max_length = X_train.shape[1]
  frame_size = X_train.shape[2]
  max_target_length = y_train.shape[1]

  # Batch size x time steps x data width
  data = tf.placeholder(tf.float32, [None, max_length, frame_size])
  data_length = tf.placeholder(tf.int32, [None])

  #  Batch size x max_target_length
  target_dense = tf.placeholder(tf.int32, [None, max_target_length])
  target_length = tf.placeholder(tf.int32, [None])

  #  Generating sparse tensor representation of target
  target = ctc_label_dense_to_sparse(target_dense, target_length)

  # Applying LSTM, returning output for each timestep (y_rnn1, 
  # [batch_size, max_time, cell.output_size]) and the final state of shape
  # [batch_size, cell.state_size]
  y_rnn1, h_rnn1 = tf.nn.dynamic_rnn(
    tf.nn.rnn_cell.LSTMCell(num_hidden, state_is_tuple=True, num_proj=num_classes), #  num_proj=num_classes
    data,
    dtype=tf.float32,
    sequence_length=data_length,
  )

  #  For sequence labelling, we want a prediction for each timestamp. 
  #  However, we share the weights for the softmax layer across all timesteps. 
  #  How do we do that? By flattening the first two dimensions of the output tensor. 
  #  This way time steps look the same as examples in the batch to the weight matrix. 
  #  Afterwards, we reshape back to the desired shape


  # Reshaping
  logits = tf.transpose(y_rnn1, perm=(1, 0, 2))

  #  Get the loss by calculating ctc_loss
  #  Also calculates
  #  the gradient.  This class performs the softmax operation for you, so    inputs
  #  should be e.g. linear projections of outputs by an LSTM.
  loss = tf.reduce_mean(tf.contrib.ctc.ctc_loss(logits, target, data_length))

  #  Define our optimizer with learning rate
  optimizer = tf.train.RMSPropOptimizer(learning_rate).minimize(loss)

  #  Decoding using beam search
  decoded, log_probabilities = tf.contrib.ctc.ctc_beam_search_decoder(logits, data_length, beam_width=10, top_paths=1)

Thanks!

Update (06/29/2016)

Thank you, @jihyeon-seo! So, we have at input of RNN something like [num_batch, max_time_step, num_features]. We use the dynamic_rnn to perform the recurrent calculations given the input, outputting a tensor of shape [num_batch, max_time_step, num_hidden]. After that, we need to do an affine projection in each tilmestep with weight sharing, so we've to reshape to [num_batch*max_time_step, num_hidden], multiply by a weight matrix of shape [num_hidden, num_classes], sum a bias undo the reshape, transpose (so we will have [max_time_steps, num_batch, num_classes] for ctc loss input), and this result will be the input of ctc_loss function. Did I do everything correct?

This is the code:

    cell = tf.nn.rnn_cell.MultiRNNCell([cell] * num_layers, state_is_tuple=True)

    h_rnn1, self.last_state = tf.nn.dynamic_rnn(cell, self.input_data, self.sequence_length, dtype=tf.float32)

    #  Reshaping to share weights accross timesteps
    x_fc1 = tf.reshape(h_rnn1, [-1, num_hidden])

    self._logits = tf.matmul(x_fc1, self._W_fc1) + self._b_fc1

    #  Reshaping
    self._logits = tf.reshape(self._logits, [max_length, -1, num_classes])

    #  Calculating loss
    loss = tf.contrib.ctc.ctc_loss(self._logits, self._targets, self.sequence_length)

    self.cost = tf.reduce_mean(loss)

Update (07/11/2016)

Thank you @Xiv. Here is the code after the bug fix:

    cell = tf.nn.rnn_cell.MultiRNNCell([cell] * num_layers, state_is_tuple=True)

    h_rnn1, self.last_state = tf.nn.dynamic_rnn(cell, self.input_data, self.sequence_length, dtype=tf.float32)

    #  Reshaping to share weights accross timesteps
    x_fc1 = tf.reshape(h_rnn1, [-1, num_hidden])

    self._logits = tf.matmul(x_fc1, self._W_fc1) + self._b_fc1

    #  Reshaping
    self._logits = tf.reshape(self._logits, [-1, max_length, num_classes])
    self._logits = tf.transpose(self._logits, (1,0,2))

    #  Calculating loss
    loss = tf.contrib.ctc.ctc_loss(self._logits, self._targets, self.sequence_length)

    self.cost = tf.reduce_mean(loss)

Update (07/25/16)

I published on GitHub part of my code, working with one utterance. Feel free to use! :)

There is an error in your code after you reshape after the RNN. If the matrix was Time Major, then your reshape is correct, but then the RNN needs to have time_major=True passed in. If the matrix was Batch major, then you need the tf.transpose(tf.reshape([-1, max_length, num_classes]), [1,0,2]) — Xiv, Jul 07 '16 at 13:29

score 7 · Accepted Answer · edited Sep 29 '19 at 10:05

7

I'm trying to do the same thing. Here's what I found you may be interested in.

It was really hard to find the tutorial for CTC, but this example was helpful.

And for the blank label, CTC layer assumes that the blank index is num_classes - 1, so you need to provide an additional class for the blank label.

Also, CTC network performs softmax layer. In your code, RNN layer is connected to CTC loss layer. Output of RNN layer is internally activated, so you need to add one more hidden layer (it could be output layer) without activation function, then add CTC loss layer.

edited Sep 29 '19 at 10:05

kkm inactive - support strike

5,190
2
32
59

answered Jun 28 '16 at 07:04

J. Seo

86
1

Thank you @jihyeon-seo. Have you had any problems training your network with CTC loss? It has been too hard overfitting this network, but in so many papers the authors said that a LSTM network overfits so easily and I couldn't overfit my network with 1 LSTM layer with 320 memory cells using only 1 utterance (TIMIT corpus, with filter bank features) even after 2000 epochs. :( – Igor Macedo Quintanilha Jun 29 '16 at 23:56
after only 100 epoch, i got the overfitted LSTM model for one sentence. – J. Seo Jul 05 '16 at 02:05
i think you can check input and output tensors between LSTM layer and CTC Loss Layer. did you check that ctc layer returns loss which is updated at every epoch? – J. Seo Jul 05 '16 at 02:13
Yes, I checked. When a use the output of filter bank as my features I cannot train the network. But, when I switched to mfcc features, everything worked smoothly. :) – Igor Macedo Quintanilha Jul 25 '16 at 03:29
@JihyeonSeo Could you explain what does `rnn layer is internally activated` mean in more detail? I am trying to understand why one additional affine transformation layer (without activation function) is needed. – Helin Wang Aug 09 '16 at 14:39

score 6 · Answer 2 · answered Jul 13 '16 at 13:11

6

See here for an example with bidirectional LSTM, CTC, and edit distance implementations, training a phoneme recognition model on the TIMIT corpus. If you train on that corpus's training set, you should be able to get phoneme error rates down to 20-25% after 120 epochs or so.

answered Jul 13 '16 at 13:11

Jon Rein

868
2
8
13

Happy to help. Would you mind accepting the answer if it's working for you? – Jon Rein Aug 03 '16 at 16:41

Using Tensorflow's Connectionist Temporal Classification (CTC) implementation

2 Answers2

Linked