3

I just started using ctc loss layer in tensorflow(r1.0) and got a little bit confused with the "labels" input

In tensorflow's API document, it says

labels: An int32 SparseTensor. labels.indices[i, :] == [b, t] means labels.values[i] stores the id for (batch b, time t). labels.values[i] must take on values in [0, num_labels)

  1. Is [b,t] and values[i] mean there is a label "values[i]" at "t" of sequence "b" in the batch?
  2. It says value must be in [0,num_labels), but for a sparse tensor, almost everywhere is 0 excepted for some specified places, so I don't really know how should the sparse tensor for ctc be like
  3. And for example, if I have a short video of hand gesture, and it has a label "1",should I label the output of all timesteps as "1", or only label the last timestep as "1" and take other as "blank"?

thanks!

JeremyCC
  • 43
  • 6

1 Answers1

6

To address your questions:
1. The notation in the documentation here seems a bit misleading, as the output label index t need not be the same as the input time slice, it's simply the index to the output sequence. A different letter could be used because the input and output sequences are not explicitly aligned. Otherwise, your assertion seems correct. I give an example below.

  1. Zero is a valid class in your sequence output label. The so-called blank label in TensorFlow's CTC implementation is the last (largest) class, which should probably not be in your ground truth labels anyhow. So if you were writing a binary sequence classifier, you'd have three classes, 0 (say "off"), 1 ("on") and 2 ("blank" output of CTC).

  2. CTC Loss is for labeling sequence input with sequence output. If you only have a single class label output for the sequence input, you're probably better off using a softmax cross entropy loss on the output of the last time step of the RNN cell.

If you do end up using CTC loss, you can see how I've constructed the training sequence through a reader here: How to generate/read sparse sequence labels for CTC loss within Tensorflow?.

As an example, after I batch two examples that have label sequences [44, 45, 26, 45, 46, 44, 30, 44] and [5, 8, 17, 4, 18, 19, 14, 17, 12], respectively, I get the following result from evaluating the (batched) SparseTensor:

SparseTensorValue(indices=array([[0, 0],
       [0, 1],
       [0, 2],
       [0, 3],
       [0, 4],
       [0, 5],
       [0, 6],
       [0, 7],
       [1, 0],
       [1, 1],
       [1, 2],
       [1, 3],
       [1, 4],
       [1, 5],
       [1, 6],
       [1, 7],
       [1, 8]]), values=array([44, 45, 26, 45, 46, 44, 30, 44,  5,  8, 17,  4, 18, 19, 14, 17, 12], dtype=int32), dense_shape=array([2, 9]))

Notice how the rows of the indices in the sparse tensor value correspond to the batch number and the columns correspond to the sequence index for that particular label. The values themselves are the sequence label classes. The rank is 2 and the size of the last dimension (nine in this case) is the length of the longest sequence.

Community
  • 1
  • 1
Jerod
  • 313
  • 2
  • 9
  • 1
    sorry for the late replying, so "t" is just the order of the specific label, no matter how long the respective input sequence will be last ? – JeremyCC Mar 16 '17 at 15:01
  • 1
    I believe that's right. However, whereas the model can only produce at most one output per input slice, your label sequence shouldn't be longer than the input length. (With label merging, it could also produce _fewer_ outputs than inputs.) – Jerod Mar 17 '17 at 18:31