How to structure and size Y-labels for multivariate sequence prediction using Keras LSTMs

Question

I am working on a sequence prediction problem where my inputs are of size (numOfSamples, numOfTimeSteps, features) where each sample is independent, number of time steps is uniform for each sample (after pre-padding the length with 0's using keras.pad_sequences), and my number of features is 2. To summarize my question(s), I am wondering how to structure my Y-label dataset to feed the model and want to gain some insight on how to properly structure my model to output what I want.

My first feature is a categorical variable encoded to a unique int and my second is numerical. I want to be able to predict the next categorical variable as well as an associated feature2 value, and then use this to feed back into the network to predict a sequence until the EOS category is output.

This is a main source I've been referencing to try and understand how to create a generator for use with keras.fit_generator. [1]

There is no confusion with how the mini-batch for "X" data is grabbed, but for the "Y" data, I am not sure about the proper format for what I am trying to do. Since I am trying to predict a category, I figured a one-hot vector representation of the t+1 timestep would be the proper way to encode the first feature, I guess resulting in a 4? Dimensional numpy matrix?? But I'm kinda lost with how to deal with the second numerical feature.

Now, this leads me to questions concerning architecture and how to structure a model to do what I am wanting. Does the following architecture make sense? I believe there is something missing that I am not understanding.

Proposed architecture (parameters loosely filled in, nothing set yet):

model = Sequential()
model.add(Masking(mask_value=0., input_shape=(timesteps, features)))
model.add(LSTM(hidden_size, return_sequences=True))
model.add(TimeDistributed(Dense(vocab_size)))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['categorical_accuracy'])
model.fit_generator(...) #ill figure this out

So, at the end, a softmax activation can predict the next categorical value for feature1. How do I also output a value for feature2 so that I can feed the new prediction for both features back as the next time-step? Do I need some sort of parallel architecture with two LSTMs that are combined somehow?

This is my first attempt at doing anything with neural networks or Keras, and I would not say I'm "great" at python, I can get by though. However, I feel I have a decent grasp at the fundamental theoretical concepts, but lack the practice.

This question is sorta open ended, with encouragement to pick apart my current strategy.

Once again, the overall goal is to predict both features (categorical, numeric) in order to predict "full sequences" from intermediate length sequences.
Ex. I train on these padded max-len sequences, but in production I want to use this to predict the remaining part of the currently unseen time-steps, which would be variable length.

Still looking for more guidance here. Thanks! – kylec123 Jan 18 '19 at 15:20 — kylec123, Jan 18 '19 at 15:20

Szymon Maszke · Answer 1 · 2019-01-17T19:10:23.933

Okay, so If I understand you properly (correct me if I'm wrong) you would like to predict next features based on the current ones.

When it comes to categorical variables, you are on point, your Dense layer should output N-1 vector containing probability of each class (while we are at it, if you, by any chance, use pandas.get_dummies remember to specify argument drop_first=True, similiar approach should be employed whatever you are using for one-hot encoding).

Except those N-1 output vector for each sample, it should output one more number for numerical value.

Remember to output logits (no activation, don't use softmax at the end like you currently do). Afterwards network output should be separated into N-1 part (your categorical feature) and passed to loss function able to handle logits (e.g. in Tensorflow it is tf.nn.softmax_cross_entropy_with_logits_v2 which applies numerically stable softmax for you).

Now, your N-th element of network output should be passed to different loss, probably Mean Squared Error.

Based on loss value of those two losses (you could take a mean of both to obtain one loss value), you backpropagate through the network and it might do just fine.

Unfortunately I'm not skilled enough in Keras in order to help you with the code, but I think you will figure it out yourself. While we're at it, I would like to suggest PyTorch for more custom neural networks (I think yours fits this description), though it's definitely doable in Keras as well, your choice.

Additional 'maybe helpful' thought: you may check Teacher Forcing for your kind of task. More on the topic and theory behind it can be found in the outstanding Deep Learning Book and code example (though in PyTorch once again), can be found in their docs here.

BTW interesting idea, mind if I use it in connection with my current research trajectory (with kudos going to you of course)? Comment on this answer if so we can talk it out in chat.

Thank you for this response. This makes sense, with two losses. In terms of how to structure this via code, I am not so clear on. (I realize you aren't a Keras person, so all good. Just pointing out confusion for any other potential responses) I have only been studying this stuff for a fairly short amount of time, so Tensorflow and Keras are the only two libraries I've really played with. I need to look into PyTorch for sure. I'll take a look at this Teacher Forcing concept now, as I've have not heard of this. Thanks! — kylec123, Jan 17 '19 at 19:18
Code-wise you probably need to implement new loss function, there are some examples over the internet (e.g. [here](https://stackoverflow.com/questions/45961428/make-a-custom-loss-function-in-keras) ). You might need to go with keras.backend, though it seems to be rather easy and you can probably figure this out. — Szymon Maszke, Jan 17 '19 at 19:33
And what about the last part starting with "BTW"? Contact me here szymon.maszke@protonmail.com, would like to hear from you in private (apparently StackOverflow doesn't allow for easy interaction between users...). — Szymon Maszke, Jan 17 '19 at 19:43

score 0 · Accepted Answer · answered Feb 11 '19 at 15:39

Basically every answer I was looking for was exampled and explained in this tutorial. Absolutely great resource for trying to understand how to model multi-output networks. This one goes through a lengthy walkthrough of a multi-output CNN architecture. It only took me about three weeks to stumble upon, however.

https://www.pyimagesearch.com/2018/06/04/keras-multiple-outputs-and-multiple-losses/

How to structure and size Y-labels for multivariate sequence prediction using Keras LSTMs

2 Answers2