How to use masking layer to mask input/output in LSTM autoencoders?

Question

I am trying to use LSTM autoencoder to do sequence-to-sequence learning with variable lengths of sequences as inputs, using following code:

inputs = Input(shape=(None, input_dim))
masked_input = Masking(mask_value=0.0, input_shape=(None,input_dim))(inputs)
encoded = LSTM(latent_dim)(masked_input)

decoded = RepeatVector(timesteps)(encoded)
decoded = LSTM(input_dim, return_sequences=True)(decoded)
sequence_autoencoder = Model(inputs, decoded)
encoder = Model(inputs, encoded)

where inputs are raw sequence data padded with 0s to the same length (timesteps). Using the code above, the output is also of length timesteps, but when we calculate loss function we only want first Ni elements of the output (where Ni is length of input sequence i, which may be different for different sequences). Does anyone know if there is some good way to do that?

Thanks!

@DanielMöller The length of output is already `timesteps`, would it be even longer if I pad it with zeros? — username123, Oct 10 '17 at 02:07
@DanielMöller Yes, that is what I did, and the problem is related to the padding. For instance, if a specific input has 5 elements, when it is fed into the autoencoder, it is padded with 5 zeros to be of length 10. Ideally when calculating the loss, we only need to care about first 5 elements of output, but due to the presence of last 5 elements (unless they are all zeros, which is almost impossible), the loss will be larger. So I wonder if I could "mask out" last 5 elements of the output when calculating the loss? — username123, Oct 10 '17 at 02:33
Now I get it... how about another Masking after "RepeatVector"? I'll write an option... — Daniel Möller, Oct 10 '17 at 02:44
@DanielMöller Sorry I do not understand... How should I mask after "RepeatVector"? — username123, Oct 10 '17 at 02:46

Daniel Möller · Accepted Answer · 2017-10-10T03:36:09.023

6

Option 1: you can always train without padding if you accept to train separate batches.

See this answer to a simple way of separating batches of equal length: Keras misinterprets training data shape

In this case, all you have to do is to perform the "repeat" operation in another manner, since you don't have the exact length at training time.

So, instead of RepeatVector, you can use this:

import keras.backend as K

def repeatFunction(x):

    #x[0] is (batch,latent_dim)
    #x[1] is inputs: (batch,length,features)

    latent = K.expand_dims(x[0],axis=1) #shape(batch,1,latent_dim)
    inpShapeMaker = K.ones_like(x[1][:,:,:1]) #shape (batch,length,1)

    return latent * inpShapeMaker

#instead of RepeatVector:
Lambda(repeatFunction,output_shape=(None,latent_dim))([encoded,inputs])

Option2 (doesn't smell good): use another masking after RepeatVector.

I tried this, and it works, but we don't get 0's at the end, we get the last value repeated until the end. So, you will have to make a weird padding in your target data, repeating the last step until the end.

Example: target [[[1,2],[5,7]]] will have to be [[[1,2],[5,7],[5,7],[5,7]...]]

This may unbalance your data a lot, I think....

def makePadding(x):

    #x[0] is encoded already repeated  
    #x[1] is inputs    

    #padding = 1 for actual data in inputs, 0 for 0
    padding =  K.cast( K.not_equal(x[1][:,:,:1],0), dtype=K.floatx())
        #assuming you don't have 0 for non-padded data

    #padding repeated for latent_dim
    padding = K.repeat_elements(padding,rep=latent_dim,axis=-1)

    return x[0]*padding

inputs = Input(shape=(timesteps, input_dim))
masked_input = Masking(mask_value=0.0)(inputs)
encoded = LSTM(latent_dim)(masked_input)

decoded = RepeatVector(timesteps)(encoded)
decoded = Lambda(makePadding,output_shape=(timesteps,latent_dim))([decoded,inputs])
decoded = Masking(mask_value=0.0)(decoded)

decoded = LSTM(input_dim, return_sequences=True)(decoded)
sequence_autoencoder = Model(inputs, decoded)
encoder = Model(inputs, encoded)

Option 3 (best): crop the outputs directly from the inputs, this also eliminates the gradients

def cropOutputs(x):

    #x[0] is decoded at the end
    #x[1] is inputs
    #both have the same shape

    #padding = 1 for actual data in inputs, 0 for 0
    padding =  K.cast( K.not_equal(x[1],0), dtype=K.floatx())
        #if you have zeros for non-padded data, they will lose their backpropagation

    return x[0]*padding

....
....

decoded = LSTM(input_dim, return_sequences=True)(decoded)
decoded = Lambda(cropOutputs,output_shape=(timesteps,input_dim))([decoded,inputs])

edited Oct 10 '17 at 03:36

answered Oct 10 '17 at 03:06

Daniel Möller

84,878
18
192
214

Thank you! I just noticed that you answered my previous question before: https://stackoverflow.com/questions/46494877/how-to-apply-lstm-autoencoder-to-variant-length-time-series-data/. The reason why I do not train separate batches for sequences of different lengths is that the number of data for each batch is typically small, which makes training result not good enough. When I switched to masking approach (as suggested by your previous answer), I got the problem in this question. – username123 Oct 10 '17 at 03:15
1

But maybe the true best one is combining options 2 and 3 (you spare processing when you have the intermediate mask, and you eliminate nonsense repeated values at the end that would(?) influence your loss function). – Daniel Möller Oct 10 '17 at 03:43
1

A test that I won't try now is: create a model with masking and see if the repeated outputs participate in backpropagation. – Daniel Möller Oct 10 '17 at 03:47
May I ask another question: when implementing these options, you use some backend functions in Keras, I am just curious if you know how these functions are integrated with optimization methods? I guess what I mean is for neural network layers such as Dense/Convolutional/etc, they have backprop functions built-in so optimizer can simply call them, are there similar mechanism for those "lower-level" backend functions? I expect this may be related to implementation of specific backends (e.g. Theano/Tensorflow), but I am not sure. – username123 Oct 10 '17 at 14:35
1

The backend functions often map 1 to 1 with theano or tensorflow functions. They're here: https://github.com/fchollet/keras/tree/master/keras/backend --- I don't know how the backpropagation works, but I assume Keras leaves it all for tensorflow/theano to do. – Daniel Möller Oct 10 '17 at 14:53
1

I always assumed that the results of `equal` / `not_equal` are constants. They don't backpropagate, but they don't change the backpropagation of the tensors they modify, unless they're 0, of course. So far, my attempts have been working properly. – Daniel Möller Oct 10 '17 at 15:04
I trained the LSTM autoencoder with my sequence data (which describe trajectories in 2D plane), I found the first point of all output data seem to be quite close to each other, but in the input data the trajectories start from different starting points so the output should not start from the same point. Do you have any idea what may be the reason? (do I need to reset the internal states somehow?) – username123 Oct 10 '17 at 15:04
I agree, `not_equal` should work as a "mask" in backprop and simply copy the corresponding data backwards. – username123 Oct 10 '17 at 15:07
No need to reset states.... maybe it's just a matter of training more. Or the `encoded` tensor is too small for the desired task. – Daniel Möller Oct 10 '17 at 16:20
By saying "encoded tensor is too small" do you mean the number of data is too small or the value of `latent_dim` is too small? – username123 Oct 10 '17 at 16:29
1

I mean latent dim. – Daniel Möller Oct 10 '17 at 16:45
You can also create loss functions that give more weight to the first step. – Daniel Möller Oct 10 '17 at 16:54
Thanks! For the current loss function, do longer sequences have more contribution to it than shorter sequences, since there are more terms? – username123 Oct 10 '17 at 18:25
Yes. Having more terms helps ignoring very few wrong terms. – Daniel Möller Oct 10 '17 at 18:28
OK, that may be a big problem in my current model. The length of my sequences varies significantly since each sequence represents a transition event in my model, some of them are pretty short in nature. I need to properly assign weights to them otherwise short sequences will be completely ignored. What I can think of now is to scale the value of each sequence based on its length, to make sure that they have roughly equal weights, but I do not think it is a good idea to use different scaling factors in input/output space. Do you have any suggestions on this? Thanks! – username123 Oct 10 '17 at 18:37
Hi @Daniel, Thanks for this excellent suggestion. I followed this for [ignoring padded/missing timesteps for decoder in AE with multiple features](https://stackoverflow.com/questions/67959601/how-to-correctly-ignore-padded-or-missing-timesteps-at-decoding-time-in-multi-fe), but my losses are NaN now. Am I doing thing right? – A.B Jun 13 '21 at 16:45

score 1 · Answer 2 · answered Jul 19 '21 at 18:47

For this LSTM Autoencoder architecture, which I assume you understand, the Mask is lost at the RepeatVector due to the LSTM encoder layer having return_sequences=False.

So another option, instead of cropping like above, could also be to create custom bottleneck layer that propagates the mask.

How to use masking layer to mask input/output in LSTM autoencoders?

2 Answers2

Option 1: you can always train without padding if you accept to train separate batches.

Option2 (doesn't smell good): use another masking after RepeatVector.

Option 3 (best): crop the outputs directly from the inputs, this also eliminates the gradients

Linked