How to represent data for LSTMs?

Question

I have sequence data that tells me what color was observed for multiple subjects at different points in time. For example:

ID	Time	Color
A	1	Blue
A	2	Red
A	5	Red
B	3	Blue
B	6	Green
C	1	Red
C	3	Orange

I want to obtain predictions for the most likely color for the next 3 time steps, as well as the probability of that color appearing. For example, for ID A, I'd like to know the next 3 items (time, color) in the sequence as well as its probability of the color appearing.

I understand that LSTMs are often used to predict this type of sequential data, and that I would feed in a 3d array like

input =[ 
[[1,1], [2,2], [5,2]], #blue at t=1, red at t=2, red at t=5 for ID A
[[0,0], [3,1], [6,3]], #nothing for first entry, blue at t=3, green at t=6 for ID B
[[0,0], [1,2], [3,4]]
]

after mapping the colors to numbers (Blue-> 1, Red->2, Green-> 3, Orange -> 4etc.). My understanding is that, by default, the LSTM just predicts the next item in each sequence, so for example

output = [
[[7, 2]], #next item is most likely red at t=7
[[9, 3]], # next item is most likely red at t=9
[[6, 2]] 
]

Is it possible to modify the output of my LSTM so that instead of just predicting the next occurence time and color, I can get the next 3 times, colors AND probabilities of the color appearing? For example, an output like

output = [
[[7, 2, 0.93], [8,2, 0.79], [10,4, 0.67]], 
[[9, 2, 0.88], [11,3, 0.70], [14,3, 0.43]], 
...
]

I've tried looking in the Sequential documentation for Keras, but I'm not sure if I've found anything.

Furthermore, I see that there's a TrainX and TrainY typically used for model.fit() but I'm also not sure what my TrainY would be here?

`Sequential` is unrelated to sequences, it is just an interface to stack layers (a better name would have been `Model`). — runDOSrun, Feb 26 '21 at 10:14

Akshay Sehgal · Answer 1 · 2021-02-26T03:29:22.983

1

LSTM just predicts the next item in each sequence...

Not really, LSTM is just a layer to help encode sequential data. It's the downstream task, (the dense layers, and output layers) that determine what is the model going to "predict".

While you can train an LSTM based model to predict the next value in a sequence (by cleverly keeping the last timestamp as your regression target, y), you would ideally want to use an LSTM based encoder-decoder architecture to properly generate sequences from input sequences.

This is the same architecture that is used for language models to generate text or machine translation models to translate English to French.

You can find a good tutorial on implementing this here. The advantage of this model is, that you can now choose to decode as many time steps as you need. So for your case, you can input a padded, fixed-length sequence of colors to the encoder, and decode 3-time steps.

From a data-prep point of view, you will have to take each sequence of colors, remove the last 3 colors as your y, and pad the rest to a fixed-length

sample = [R, G, B, B, R, G, R, R, B]
X = [<start>, 0, 0, 0, 0, 0, R, G, B, B, R, G, <end>]  #Padded input sequence
y = [<start>, R, R, B, <end>]                          #Y sequence

You will find the necessary preprocessing, training, and inferences steps in the link above.

edited Feb 26 '21 at 03:29

answered Feb 25 '21 at 00:36

Akshay Sehgal

18,741
3
21
51

I see- so if i wanted to obtain the next 3 colors in each sequence, I'd remove the last three observed colors in each sequence to set as my y. When we pad, does it matter if we pad at the beginning or end? My sequences have different lengths so some would need more padding that others. e.g. I might have 8 items in sequence A and 6 items in sequence B – lvnwrth Feb 25 '21 at 16:08
It's better to do pre-padding than post-padding. The reason is that hidden states get flushed out with post padding. In prepadding the 0s are followed by the actual sequence thus retaining the representation of the sequence much better. I have updated my answer to reflect the same. Read [this](https://arxiv.org/pdf/1903.07288.pdf#:~:text=For%20Sentiment%20Analysis%2C%20LSTMs%20are,combined%20to%20perform%20a%20task.) for more detail – Akshay Sehgal Feb 26 '21 at 03:29
Also, you can pad and truncate your sequences, both. So if sequences are smaller than average then pad them, but for very long sequences, you may want to truncate. – Akshay Sehgal Feb 26 '21 at 03:30
@AkshaySehgal Post-padding is significantly more popular than pre-padding (even the tutorial you've linked used post). While this might be more out of tradition than evidence, the paper you've linked is too limited to claim that one is generally better than the other in terms of model accuracy (if you have any other peer-reviewed good quality studies, I would be interested). In addition, "flushing out" could also apply at the *start* of a sequence but is in practice not an issue since we do masking. – runDOSrun Feb 26 '21 at 10:25
To support my claim that post-padding is more popular: TF, Keras have post as the default and [recommend it](https://www.tensorflow.org/guide/keras/masking_and_padding#padding_sequence_data), CUDNN does postpadding, and Huggingface [Transformers](https://huggingface.co/transformers/v3.0.2/preprocessing.html) only support postpadding, so I would really challenge that prepadding is more popular. PyTorch [doesnt even support](https://github.com/pytorch/pytorch/issues/10536) prepadding right now. – runDOSrun Feb 26 '21 at 10:46
@runDOSrun, i think you are mistaken, please re-read the paper i have linked. That and other articles suggest that pre-padding is the way to go. Not post padding, for LSTMs. I can produce multiple links for the same. [link 1](https://medium.com/@xin.inf19/pre-padding-and-post-padding-in-lstm-975faff4efa5), [link 2](https://arxiv.org/pdf/1903.07288.pdf#:~:text=For%20Sentiment%20Analysis%2C%20LSTMs%20are,combined%20to%20perform%20a%20task.), [link 3](https://stackoverflow.com/questions/46298793/how-does-choosing-between-pre-and-post-zero-padding-of-sequences-impact-results). – Akshay Sehgal Feb 26 '21 at 15:50
Regarding your claim on Transformers, Masks provide an efficient way to distinguish between padded and non-padded parts of the sequence. – Akshay Sehgal Feb 26 '21 at 15:52
And, the link that you show about TF recommending it is only due to the fact that it allows you to use the CuDNN implementation (as it explicitly mentions after the referred sentence) – Akshay Sehgal Feb 26 '21 at 15:55
Plus pre-padding is proven (as in the paper I link above) to perform better with RNNs and LSTMs but dont have any effect on CNN based architectures (please read the final summary of the paper) – Akshay Sehgal Feb 26 '21 at 15:56
@AkshaySehgal Thanks, but your sources provide claims without any evidence (the SO even walks it back in the comments). The cited paper, in particular, doesn't seem to be peer-reviewed and provides a very small-scale experiment with a single model on a single dataset. I'm sorry if this might seem pedantic but scientifically this needs to be replicated in more studies to be considered strong evidence. If we can agree to disagree, I have made my point: it is definitely not obvious which is better as there are different opinions floating around. – runDOSrun Feb 26 '21 at 16:37
As you point out, it could be that results vary based on specific scenarios, so I do understand where you are coming from. Would also help, for my understanding, if you have any sources which definitely show the reason why post-padding is better. – Akshay Sehgal Feb 26 '21 at 17:31

How to represent data for LSTMs?

1 Answers1