Correctly structuring text data for text generation with Tensorflow model

Question

I am trying to train my model to generate sentences no longer that 210 characters. From what I have read I have only seen training on 'continuous' text. Like a book. However I am trying to train my model on single sentences.

I'm pretty new to tensorflow and ML so right now I am able to train my model but it generates garbage, seemingly random text. I have 10,000 sentences so I think I have sufficient data.

Overview of my data

Structure [['SENTENCE'], ['SENTENCE2']...]

Data Prep

tokenizer = keras.preprocessing.text.Tokenizer(num_words=209, lower=False, char_level=True, filters='#$%&()*+-<=>@[\\]^_`{|}~\t\n')
tokenizer.fit_on_texts(df['title'].values)
df['encoded_with_keras'] = tokenizer.texts_to_sequences(df['title'].values)

dataset = df['encoded_with_keras'].values
dataset = tf.keras.preprocessing.sequence.pad_sequences(dataset, padding='post')

dataset = dataset.flatten()

dataset = tf.data.Dataset.from_tensor_slices(dataset)

sequences = dataset.batch(seq_len+1, drop_remainder=True)

def create_seq_targets(seq):
    input_txt = seq[:-1]
    target_txt = seq[1:]
    return input_txt, target_txt

dataset = sequences.map(create_seq_targets)

dataset = dataset.shuffle(buffer_size).batch(batch_size, drop_remainder=True)

Model

def create_model(vocab_size, embed_dim, rnn_neurons, batch_size):
    model = Sequential()
    model.add(Embedding(vocab_size, embed_dim, batch_input_shape=[batch_size, None],input_length=209, mask_zero=True))
    model.add(LSTM(rnn_neurons, return_sequences=True, stateful=True,))
    model.add(Dropout(0.2))
    model.add(Dense(258, activation='relu'))
    model.add(Dropout(0.2))
    model.add(Dense(128, activation='relu'))
    model.add(Dropout(0.2))
    model.add(Dense(vocab_size, activation='softmax'))
    model.compile(optimizer='adam', loss="sparse_categorical_crossentropy")
    return model

When I give the model a sequence to start from I get back absolute nonsense and eventually the model predicts a 0 which is not in the char_index mapping.

Edit

Text Generation


epochs = 2

# Directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True)


model = create_model(vocab_size = vocab_size,
  embed_dim=embed_dim,
  rnn_neurons=rnn_neurons,
  batch_size=1)

model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))

model.build(tf.TensorShape([1, None]))

def generate_text(model, start_string):
  num_generate = 200

  input_eval = [char_2_index[s] for s in start_string]
  input_eval = tf.expand_dims(input_eval, 0)

  text_generated = []

  temperature = 1

  # model.reset_states()
  for i in range(num_generate):
      print(text_generated)
      predictions = model(input_eval)

      predictions = tf.squeeze(predictions, 0)

      predictions = predictions / temperature

      predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()
      print(predicted_id)

      input_eval = tf.expand_dims([predicted_id], 0)

      text_generated.append(index_2_char[predicted_id])

  return (start_string + ''.join(text_generated))

Would you care to show how you're training and how you're predicting? — Daniel Möller, Feb 14 '20 at 18:17

Daniel Möller · Answer 1 · 2020-02-14T18:42:11.930

There are a few things that must be changed on the first sight.

Tokenizer must have num_words = vocab_size
At first (didn't analyse it deeply), I can't imagine why you're flattening your dataset and getting slices if it's probably correctly structured
You cannot use stateful=True if you don't want that "batch 2 is a sequel of batch 1", you have individual sentences, so stateful=False. (Unless you are training correctly with manual training loops and resetting states for each batch, which is unnecessary trouble in the training phase)

What you need to check visually:

Input data must have format like:

[
    [1,2,3,6,10,4,10, ...up to sentence length - 1...],
    [5,6,3,6,7,3,11,... up to sentence length - 1...],
    .... up to number of sentences ...
]

Output data must then be:

[
    [2,3,6,10,4,10,15 ...], #equal to input data, shifted by 1
    [6,3,6,7,3,11,13, ...],
    ...
]

Print a few rows of them to check if they're correctly preprocessed as intended.

Training will then be easy:

model.fit(input_data, output_data, epochs=....)

Yes, your model will predict zeros, as you have zeros in your data, that's not weird: you did a pad_sequences.
You can interpret a zero as a "sentence end" in this case, since you did a 'post' pading. When your model gives you a zero, it decided that the sentence it's generating should end at that point - if it was well trained, it will probably continue outputting zeros for that sentence from this point on.

Generating new senteces

This part is more complex and you need to rewrite the model, now being stative=True, and transfer the weights from the trained model to this new model.

Before anything, call model.reset_states().

You will need to manually feed a batch with shape (number_of_sentences=batch_size, 1). This will be the "first character" of each of the sentences it will generate. The output will be the "second character" of each sentence.

Get this output and feed the model with it. It will generate the "third character" of each sentence. And so on.

When all outputs are zero, all sentences are fully generated and you can stop the loop.

Call model.reset_states() again before trying to generate a new batch of sentences.

You can find examples of this kind of predicting here: https://stackoverflow.com/a/50235563/2097240

Still getting seemingly random output. – GrepThis Feb 16 '20 at 16:19 — GrepThis, Feb 16 '20 at 16:19

Correctly structuring text data for text generation with Tensorflow model

1 Answers1

Generating new senteces