Keras LSTM Model for text-generation purpose

Question

I am a beginner with Keras and in writing Neural Networks models and actually I'm trying to write a LSTM for text-generation purpose, without success. What am I doing wrong?

I read this question: here and other articles but there is something I am missing I can't get, sorry if I seem dumb.

The goal

My purpose is to generate english articles of a fixed length (1500 by now).

Suppose I have a 20k records dataset in sequences (articles, basically) of different lengths, I set a fixed length for all articles (MAX_SEQUENCE_LENGTH=1500) and tokenized them, getting a matrix (X, my training-data) looking like:

[[   0    0    0 ...   88  664  206]
 [   0    0    0 ...    1   93  140]
 [   0    0    0 ...    3  173 2283]
 ...
 [  50 2761    4 ...  167  148  156]
 [   0    0    0 ...   10   77  206]
 [   0    0    0 ...  167  148  156]]

with a shape of 20000x1500
the output of my LSTM should be a 1 x MAX_SEQUENCE_LENGTH array of tokens.

My model looks like that:

def generator_model(sequence_input, embedded_sequences, output_shape):
    layer = LSTM(16,return_sequences = True)(embedded_sequences)
    layer = LSTM(32,return_sequences = True)(layer)
    layer = Flatten()(layer)
    output = Dense(output_shape, activation='softmax')(layer)
    generator = Model(sequence_input, output)
    return generator

with:
sequence_input = Input(batch_shape=(1, 1,1500), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)
output_shape = MAX_SEQUENCE_LENGTH

the LSTM is supposed to train, with model.fit(), on a training-set of 20k x MAX_SEQUENCE_LENGTH shape (X).

and getting an array of tokens with 1 x MAX_SEQUENCE_LENGTH shape as output when I call model.predict(seed), with seed a random noise array.

compile, fit and predict

comments for the following section:
. generator.compile works, the model is given in edit section of ths post.
. generator.fit compile, epochs=1 param is for testing-purpose, will be BATCH_NUM
. now i have some doubts on the y I give to generator.fit, by now I'm giving a matrix of 0 as target output, if I generate it with a different shape from the X.shape[0], it throw the error, this means it needs to have a label for every record in X. but if I give him a matrix of 0 as target for model.fit, isn't it going to predict just arrays of 0?
. the error is giving is always the same, despite i use the noise_generator() or noise_integer_generator(), i believe it's because it doesn't like the y_shape param i'm giving

embedding_layer = load_embeddings(word_index)
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,))
embedded_sequences = embedding_layer(sequence_input)
generator = generator_model(sequence_input, embedded_sequences, X.shape[1])
print(generator.summary())
generator.compile(loss='sparse_categorical_crossentropy', optimizer='adam')
Xnoise = generate_integer_noise(MAX_SEQUENCE_LENGTH)
y_shape = np.zeros((X.shape[0],), dtype=int)
generator.fit(X, y_shape, epochs=1)
acc = generator.predict(Xnoise, verbose=1)

But actually I'm getting the following error

ValueError: Error when checking input: expected input_1 to have shape (1500,) but got array with shape (1,)

when I call:

Xnoise = generate_noise(samples_number=MAX_SEQUENCE_LENGTH)
generator.predict(Xnoise, verbose=1)

The noise I give is a 1 x 1500 array, but it seems it's expecting a (1500,) matrix, So there must be some kind of error in the shape settings for my output.

Is my model correct for my purpose? or did I wrote something really really stupid I can't see?

Thanks for the help you can give me, I appreciate that!

edit

Changelog:

v1.
###
- Changed model structure, now return_sequences = True and using shape instead of batch_shape
###
- Changed 
sequence_input = Input(batch_shape=(1,1,1500), dtype='int32')
to
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,))
###
- Changed the error the model is giving

v2.
###
- Changed generate_noise() code
###
- Added generate_integer_noise() code
###
- Added full sequence with the model compile, fit and predict
###
- Added model.fit summary under the model summary, in the tail of the post

generate_noise() code:

def generate_noise(samples_number, mean=0.5, stdev=0.1):
    noise = np.random.normal(mean, stdev, (samples_number, MAX_SEQUENCE_LENGTH))
    print(noise.shape)
    return noise

which print: (1500,)

generate_integer_noise() code:

def generate_integer_noise(samples_number):
    noise = []
    for _ in range(0, samples_number):
        noise.append(np.random.randint(1, MAX_NB_WORDS))
    Xnoise = np.asarray(noise)
    return Xnoise

my function load_embeddings() is as follow:

def load_embeddings(word_index, embeddingsfile='Embeddings/glove.6B.%id.txt' %EMBEDDING_DIM):
    embeddings_index = {}
    f = open(embeddingsfile, 'r', encoding='utf8')
    for line in f:
        values = line.split(' ') #split the line by spaces
        word = values[0] #each line starts with the word
        coefs = np.asarray(values[1:], dtype='float32') #the rest of the line is the vector
        embeddings_index[word] = coefs #put into embedding dictionary
    f.close()

    print('Found %s word vectors.' % len(embeddings_index))

    embedding_matrix = np.zeros((len(word_index) + 1, EMBEDDING_DIM))
    for word, i in word_index.items():
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            # words not found in embedding index will be all-zeros.
            embedding_matrix[i] = embedding_vector

    embedding_layer = Embedding(len(word_index) + 1,
                                EMBEDDING_DIM,
                                weights=[embedding_matrix],
                                input_length=MAX_SEQUENCE_LENGTH,
                                trainable=False)
    return embedding_layer

model summary:

Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         (None, 1500)              0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 1500, 300)         9751200   
_________________________________________________________________
lstm_1 (LSTM)                (None, 1500, 16)          20288     
_________________________________________________________________
lstm_2 (LSTM)                (None, 1500, 32)          6272      
_________________________________________________________________
flatten_1 (Flatten)          (None, 48000)             0         
_________________________________________________________________
dense_1 (Dense)              (None, 1500)              72001500  
=================================================================
Total params: 81,779,260
Trainable params: 72,028,060
Non-trainable params: 9,751,200
_________________________________________________________________

model.fit() summary (using a 999-sized dataset for testing, instad of the 20k-sized):

999/999 [==============================] - 62s 62ms/step - loss: 0.5491 - categorical_accuracy: 0.9680

Now in looks like an error somewhere in `generate_noise` or `sequence_input = Input(batch_shape=(1, 1,1500), dtype='int32')` `batch_shape` should be changed to `(1, 1500)`. Could you provide a code of `generate_noise` function and `Xnoise.shape`? All dimensions of `Xnoise` except first should be equals to the `batch_shape[1:]`, I guess. — Mikhail Stepanov, Jan 21 '19 at 13:37
Ok, there's some unclear points: 1) how do you fit your model? 2) is there `return_sequences` or `stateful` = `True`? I guess the fomer. Then you don't need to specify `batch_shape`, but could use a `shape` instead. Also, what's the desired shape of the target? `(?, 1500)`, am I right? — Mikhail Stepanov, Jan 21 '19 at 14:59
Also, how do you plan to force `embedding_layer` to process a float values (gaussian noise), not an integers? Probably you should sample integers. — Mikhail Stepanov, Jan 21 '19 at 15:05
I rewrite an answer, also I kindly advice you to edit an answer one more time and plcae all definitions exactly once and before they are used. - Changelog is a good idea! — Mikhail Stepanov, Jan 21 '19 at 15:28
thanks for the suggestions! Edited the post and the changelog, actually i have no strategy to force the NN to process float values, so i added an int-noise generator, as you kindly suggested! Yes, i changed to return_sequences = true! And yes, output shape should be (?, 1500). Actually i just need an array of 1500, which will be my auto-generated article! Thanks again for suggestions :) — Basionkler, Jan 22 '19 at 10:08
Do you still get this error message or your actual code is working? If latter, you could separate it from initial question, post it as an answer and accept it - self-answered questions is ok — Mikhail Stepanov, Jan 22 '19 at 10:18
still having the error in the post. I mean, compile and fit works, but predict is not accepting any array i'm giving (even if printed shape is (1500,)) — Basionkler, Jan 22 '19 at 10:22
Printed shape should be `(1, 1500)`, or `(some, 1500)`, unfortunatly it's not the same as `(1500,)`. By the way `np.random.normal(1, 1, (1, 1500))`.shape is (1, 1500), how could you get (1500,) in a `generate_noise`? Ok, guess, there's a trouble with `generate_integer_noise` — Mikhail Stepanov, Jan 22 '19 at 10:28
I've modified `generate_integer_noise` function - now it returns date of correct shape. N.B: My implementation uses number of samples, not a sequence length as an argument — Mikhail Stepanov, Jan 22 '19 at 10:34

Mikhail Stepanov · Accepted Answer · 2019-01-22T10:34:04.563

I rewrote full answer, now it works (at least compiles and runs, can't say anything about convergence).

First, I don't know why you use sparse_categorical_crossentropy instead of categorical_crossentropy? It could be important. I change the model a bit, so it compiles and use a categorical_crossentropy. If you need a sparse one, change the shape of a target.

Also, I change batch_shape to shape argument, because it allows to use batches of different shape. It's easier to work with.

And the last edit: you should change generate_noise, because an Embedding layer awaits a numbers from (0, max_features), not the normally distributed floats (see a comment in the function).

EDIT
Addressing the last comments, I've removed a generate_noise and post modified generate_integer_noise function:

from keras.layers import Input, Embedding, LSTM
from keras.models import Model
import numpy as np


def generate_integer_noise(samples_number):
    """
    samples_number is a number of samples, i.e. first dimension in (some, 1500)
    """
    return np.random.randint(1, MAX_NB_WORDS, size=(samples_number, MAX_SEQUENCE_LENGTH))

MAX_SEQUENCE_LENGTH = 1500
"""
Tou can use your definition of embedding layer, 
I post to make a reproducible example
"""
max_features, embed_dim = 10, 300
embedding_matrix = np.zeros((max_features, embed_dim))
output_shape = MAX_SEQUENCE_LENGTH

embedded_layer = Embedding(
    max_features,
    embed_dim,
    weights=[embedding_matrix],
    trainable=False
)


def generator_model(embedded_layer, output_shape):
    """
    embedded_layer: Embedding keras layer
    output_shape: shape of the target
    """
    sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH, ))
    embedded_sequences = embedded_layer(sequence_input)   # Set trainable to the True if you wish to train

    layer = LSTM(32, return_sequences=True)(embedded_sequences)
    layer = LSTM(64, return_sequences=True)(layer)
    output = LSTM(output_shape)(layer)

    generator = Model(sequence_input, output)
    return generator


generator = generator_model(embedded_layer, output_shape)

noise = generate_integer_noise(32)

# generator.predict(noise)
generator.compile(loss='categorical_crossentropy', optimizer='adam')
generator.fit(noise, noise)

many many thanks for your advice! I edited the answer with the changes i made! I alredy had an Embedding layer, but i forgot to paste the code! i changed a bit the model cause it was giving me some input_shape troubles with stateful param set on true, so i tried to make it simpler than before but i think i made another mistake :( — Basionkler, Jan 21 '19 at 13:21
it's working!! Still need to verify if the predict output makes sense, but i guess this is another question! You are my hero, many many many thanks for the help, now i think i understand better how all those things works! :D — Basionkler, Jan 22 '19 at 10:45
Glad to hear it, be careful with shapes, they are a great timekillers — Mikhail Stepanov, Jan 22 '19 at 10:46

Keras LSTM Model for text-generation purpose

The goal

compile, fit and predict

edit

1 Answers1