LSTM Followed by Mean Pooling (TensorFlow)

Question

I am aware that there is a similar topic at LSTM Followed by Mean Pooling, but that is about Keras and I work in pure TensorFlow.

I have an LSTM network where the recurrence is handled by:

outputs, final_state = tf.nn.dynamic_rnn(cell,
                                         embed,
                                         sequence_length=seq_lengths,
                                         initial_state=initial_state)

where I pass the correct sequence lengths for each sample (padding by zeros). In any case, outputs contains irrelevant outputs since some samples produce longer outputs than others, based on sequence lengths.

Right now I'm extracting the last relevant output by means of the following method:

def extract_axis_1(data, ind):
    """
    Get specified elements along the first axis of tensor.
    :param data: Tensorflow tensor that will be subsetted.
    :param ind: Indices to take (one for each element along axis 0 of data).
    :return: Subsetted tensor.
    """

    batch_range = tf.range(tf.shape(data)[0])
    indices = tf.stack([batch_range, ind], axis=1)
    res = tf.reduce_mean(tf.gather_nd(data, indices), axis=0)

where I pass sequence_length - 1 as indices. In reference to the last topic, I would like to select all relevant outputs followed by average pooling, instead of just the last one.

Now, I tried passing nested lists as indeces to extract_axis_1 but tf.stack does not accept this.

Any solution directions for this?

What do you mean with "relevant output"? Usually, you train the network to predict also a "STOP" symbol: your real output is what is in between the "GO" symbol and the "STOP" symbol. What are you going to do after the "relevant output" filtering? — Giuseppe Marra, Sep 04 '17 at 07:46
I mean that there could be 100 outputs (number of unrolled cells), but the input sequence was only of size 10. I want the outputs corresponding to those 10 inputs / cells. After obtaining those, I want to average them and then predict a binary class (with a simple fully connected layer). Right now I am trying that with the only the last relevant output, but that proves to be hard. — riccardo_92, Sep 04 '17 at 07:53

Giuseppe Marra · Answer 1 · 2017-09-04T08:02:57.133

0

You can exploit the weight parameter of the tf.contrib.seq2seq.sequence_loss function.

From the documentation:

weights: A Tensor of shape [batch_size, sequence_length] and dtype float. weights constitutes the weighting of each prediction in the sequence. When using weights as masking, set all valid timesteps to 1 and all padded timesteps to 0, e.g. a mask returned by tf.sequence_mask.

You need to compute a binary mask that distinguish between your valid outputs and invalid ones. Then you can just provide this mask to the weights parameter of the loss function (probably, you will want to use a loss like this one); the function will not consider the outputs with a 0 weight in the computation of the loss.

If you can't/don't need to use a sequence loss you can do exactly the same thing manually. You compute a binarymask and then multiply your outputs by this mask and provide these as inputs to your fully connected layer.

edited Sep 04 '17 at 08:02

answered Sep 04 '17 at 07:58

Giuseppe Marra

1,094
7
16

Right now I am only using the last relevant output. Since the loss is computed using only that relevant output, I don't really need to mask my output. Am I wrong? – riccardo_92 Sep 04 '17 at 08:02
But you said you would like to try all the relevant outputs not only the last one – Giuseppe Marra Sep 04 '17 at 08:03
Exactly, but since I want to average them (over the time-axis), I would have to do the masking before computing the loss anyway. – riccardo_92 Sep 04 '17 at 08:05
Weighting is necessary anyway. You should do something like: `avg(output*mask) / sum(mask)`. However think at using ALL your outputs and not only the mean of them. This is quite different. Averaging makes you lose the time information of the predictions. – Giuseppe Marra Sep 04 '17 at 08:07
I think you make a good point. So `mask` would basically be Tensor containing 1's and 0's (and then * would mean `tf.matmul()`)? – riccardo_92 Sep 04 '17 at 08:09
Yes, mask is a 0/1 mask (0 on invalid inputs, 1 on valid inputs). The * is the element-wise (you need to keep exactly the same dimensions; you are simply putting to zero outputs corresponding to invalid(i.e. padded) inputs). – Giuseppe Marra Sep 04 '17 at 08:12
This is indeed a feasible solution in the case of a single batch. Let me demonstrate: `a = tf.constant([[[1,2,3], [4,5,6]], [[7,8,9], [10, 11, 12]]]) mask = np.array( [[[1, 0, 0], [0, 1, 1]], [[1, 0, 0], [0, 1, 1]]] ) with tf.Session() as sess: masked = tf.multiply(a, mask) print(masked.eval())` which will return `[[[ 1 0 0] [ 0 5 6]] [[ 7 0 0] [ 0 11 12]]]` as expected. However, the first dimension represents the mini-batches, and I would still have to average over time (second dimension), but per mini-batch. This leaves me with the same problem. – riccardo_92 Sep 04 '17 at 09:19
It is feasible in any case. tf.reduce_sum(tf.multiply(output,mask), axis=1) / tf.reduce_sum(mask, axis=1). tf.multiply supports broadcasting: it will take care of tiling over the input dimension. – Giuseppe Marra Sep 04 '17 at 12:03

LSTM Followed by Mean Pooling (TensorFlow)

1 Answers1