2

This question is rather abstract and not necessarily tied to tensorflow or keras. Say that you want to train a language model, and you want to use inputs of different sizes for your LSTMs. Particularly, I'm following this paper: https://www.researchgate.net/publication/317379370_A_Neural_Language_Model_for_Query_Auto-Completion.

The authors use, among other things, word embeddings and one-hot encoding of characters. Most likely, the dimensions of each of these inputs are different. Now, to feed that into a network, I see a few alternatives but I'm sure I'm missing something and I would like to know how it should be done.

  • Create a 3D tensor of shape (instances, 2, max(embeddings,characters)). That is, padding the smaller input with 0s.
  • Create a 3D tensor of shape (instances, embeddings+characters, 1)). That is, concatenating inputs.

It looks to me that both alternatives are bad for efficiently training the model. So, what's the best way to approach this? I see the authors use an embedding layer for this purpose, but technically, what does that mean?


EDIT

Here are more details. Let's call these inputs X (character-level input) and E (word-level input). On each character of a sequence (a text), I compute x, e and y, the label.

  • x: character one-hot encoding. My character index is of size 38, so this is a vector filled with 37 zeros and one 1.
  • e: precomputed word embedding of dimension 200. If the character is a space, I fetch the word embedding of the previous word in the sequence, Otherwise, I assign the vector for incomplete word (INC, also of size 200). Real example with the sequence "red car": r>INC, e>INC, d>INC, _>embeddings["red"], c>INC, a>INC, r>INC.
  • y: the label to be predicted, which is the next character, one-hot encoded. This output is of the same dimension as x because it uses the same character index. In the example above, for "r", y is the one-hot encoding of "e".
Johy
  • 317
  • 1
  • 5
  • 16
  • It seems that E is also a character sequence, not a word sequence, can you clear that? – Daniel Möller Sep 12 '17 at 21:18
  • No, `e` is the word embedding for the character being read during training. More specifically, for each word in the dictionary, it is a vector or 200 floats trained using Word2Vec. But since I do not have a complete word until the space character, this vector is filled randomly with 200 floats. When a space character is read, `e` is then the word embeddings of the previous word in the sequence. Is it clearer? – Johy Sep 12 '17 at 21:33
  • Work with words only, this approach will create lots of unnecessary and misleading data in the model. – Daniel Möller Sep 12 '17 at 22:17
  • 1
    One embedding for characters as one input, and one embedding for words as a parallel input. – Daniel Möller Sep 12 '17 at 22:23
  • I see, and both embeddings with the same size I guess? – Johy Sep 12 '17 at 23:10
  • 1
    Not necessarily... char embeddings should probably be way smaller, after all, there are only 26 (or 52) characters plus some few extras. You can have both embeddings and concat them on the last axis: [one example of how to create parallel layers](https://stackoverflow.com/questions/46158427/can-one-create-disconnected-hidden-layers-in-keras/46159214#46159214), in this case, you should define two input tensors, each one goes to a different embedding, and after the embedding you concat them. – Daniel Möller Sep 12 '17 at 23:16
  • 1
    That's exactly what I wanted to do, except I couldn't find the words to make it clear. Thank you soooo very much for helping me out, the link you posted is perfect and your explanations crystal clear. – Johy Sep 12 '17 at 23:27

1 Answers1

1

According to keras documentation, the padding idea seems to be the one. There is the masking parameter in the embedding layer, that will make keras skip these values instead of processing them. In theory, you don't lose that much performance. If the library is well built, the skipping is actually skipping extra processing.

You just need to take care not to attribute the value zero to any other character, not even spaces or unknown words.

An embedding layer is not only for masking (masking is just an option in an embedding layer).

The embedding layer transforms integer values from a word/character dictionary into actual vectors of a certain shape.

Suppose you have this dictionary:

1: hey
2: ,
3: I'm
4: here
5: not

And you form sentences like

[1,2,3,4,0] -> this is "hey, I'm here"
[1,2,3,5,4] -> this is "hey, I'm not here"
[1,2,1,2,1] -> this is "hey, hey, hey"

The embedding layer will tranform each of those integers into vectors of a certain size. This does two good things at the same time:

  • Transforms the words in vectors because neural networks can only handle vectors or intensities. A list of indices cannot be processed by a neural network directly, there is no logical relation between indices and words

  • Creates a vector that will be a "meaningful" set of features for each word.

And after training, they become "meaningful" vectors. Each element starts to represent a certain feature of the word, although that feature is obscure to humans. It's possible that an embedding be capable of detecting words that are verbs, nouns, feminine, masculine, etc, everything encoded in a combination of numeric values (presence/abscence/intensity of features).


You may also try the approach in this question, which instead of using masking, needs to separate batches by length, so each batch can be trained at a time without needing to pad them: Keras misinterprets training data shape

Daniel Möller
  • 84,878
  • 18
  • 192
  • 214
  • Thanks a lot for helping me clear my mind! :-) The thing is, I am using a trained word embedding already. Therefore, I have this embedding of dimension 200 and this other input (one hot encoding) of dimension 38. How would you feed these 2 inputs into the embedding layer? – Johy Sep 12 '17 at 20:05
  • The embedding doesn't take one_hot encoded inputs, it must take integers. Can you tell mode about the other input? What is it? – Daniel Möller Sep 12 '17 at 20:10