12

I have coded a sequence to sequence learning LSTM in keras myself using the knowledge gained from the web tutorials and my own intuitions. I converted my sample text to sequences and then padded using pad_sequence function in keras.

from keras.preprocessing.text import Tokenizer,base_filter
from keras.preprocessing.sequence import pad_sequences

def shift(seq, n):
    n = n % len(seq)
    return seq[n:] + seq[:n]

txt="abcdefghijklmn"*100

tk = Tokenizer(nb_words=2000, filters=base_filter(), lower=True, split=" ")
tk.fit_on_texts(txt)
x = tk.texts_to_sequences(txt)
#shifing to left
y = shift(x,1)

#padding sequence
max_len = 100
max_features=len(tk.word_counts)
X = pad_sequences(x, maxlen=max_len)
Y = pad_sequences(y, maxlen=max_len)

After a carefully inspection I found my padded sequence looks like this

>>> X[0:6]
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7]], dtype=int32)
>>> X
array([[ 0,  0,  0, ...,  0,  0,  1],
       [ 0,  0,  0, ...,  0,  0,  3],
       [ 0,  0,  0, ...,  0,  0,  2],
       ..., 
       [ 0,  0,  0, ...,  0,  0, 13],
       [ 0,  0,  0, ...,  0,  0, 12],
       [ 0,  0,  0, ...,  0,  0, 14]], dtype=int32)

Is the padded sequence suppose to look like this? Except the last column in the array the rest are all zeros. I think I made some mistake in padding the text to sequence and if so can you tell me where I made the error?

Marcin Możejko
  • 39,542
  • 10
  • 109
  • 120
Eka
  • 14,170
  • 38
  • 128
  • 212

3 Answers3

10

If you want to tokenize by char, you can do it manually, it's not too complex:

First build a vocabulary for your characters:

txt="abcdefghijklmn"*100
vocab_char = {k: (v+1) for k, v in zip(set(txt), range(len(set(txt))))}
vocab_char['<PAD>'] = 0

This will associate a distinct number for every character in your txt. The character with index 0 should be preserved for the padding.

Having the reverse vocabulary will be usefull to decode the output.

rvocab = {v: k for k, v in vocab.items()}

Once you have this, you can first split your text into sequences, say you want to have sequences of length seq_len = 13 :

[[vocab_char[char] for char in txt[i:(i+seq_len)]] for i in range(0,len(txt),seq_len)]

your output will look like :

[[9, 12, 6, 10, 8, 7, 2, 1, 5, 13, 11, 4, 3], 
 [14, 9, 12, 6, 10, 8, 7, 2, 1, 5, 13, 11, 4],
 ...,
 [2, 1, 5, 13, 11, 4, 3, 14, 9, 12, 6, 10, 8], 
 [7, 2, 1, 5, 13, 11, 4, 3, 14]]

Note that the last sequence doesn't have the same length, you can discard it or pad your sequence to max_len = 13, it will add 0's to it.

You can build your targets Y the same way, by shifting everything by 1. :-)

I hope this helps.

Nassim Ben
  • 11,473
  • 1
  • 34
  • 52
6

The problem is in this line:

tk = Tokenizer(nb_words=2000, filters=base_filter(), lower=True, split=" ")

When you set such split (by " "), due to nature of your data, you'll get each sequence consisting of a single word. That's why your padded sequences have only one non-zero element. To change that try:

txt="a b c d e f g h i j k l m n "*100
Marcin Możejko
  • 39,542
  • 10
  • 109
  • 120
  • Thank you for pointing out the error but what is the best way to solve this. The docs in [keras](https://keras.io/preprocessing/text/#tokenizer) is very vague. – Eka Feb 03 '17 at 01:24
  • What are your sequences separated with? – Marcin Możejko Feb 03 '17 at 07:16
  • my sequence looks something like this `abcdefghijklmnabcdefghijklmn.....mn` I want to separate it as individual letters 'a b c d e f g h i j k l m n...` that is as characters (char sequence to sequence learning) – Eka Feb 03 '17 at 08:06
  • Try "" as a spkit. – Marcin Możejko Feb 03 '17 at 08:48
  • I already did that but its giving some error `ValueError: maketrans arguments must have same length`. I believe the problem is with the `pad_sequences` because with my previous parameters Tokenizer split the characters and converted into sequence `>>> x #result [[1], [3], [2], [5],...` – Eka Feb 03 '17 at 12:17
  • Still I am confused I am trying to code a `char-rnn` thats why I am splitting words into individual characters. For example `A Youtube user has uploaded a video showcasing the differences between The Evil Within running with boost mode on PS4 Pro and the base PS4.` split this text to its individual characters and not words – Eka Feb 04 '17 at 05:17
  • I don't understand - so you want to split your text into chars or words then? – Marcin Możejko Feb 04 '17 at 10:24
  • I want to write a char rnn https://github.com/fchollet/keras/blob/master/examples/lstm_text_generation.py with least number of lines of codes but it seems difficult. I have no idea what else to do now? – Eka Feb 04 '17 at 13:20
0

The argument padding controls padding either before or after each sequence. Use like this:

X = pad_sequences(x, maxlen=max_len, padding='post')
Y = pad_sequences(y, maxlen=max_len, padding='post')