1

I just prepared text data using the Keras Tokenizer

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
VOCAB_SIZE= 10000
tokenizer = Tokenizer(num_words = VOCAB_SIZE)
tokenizer.fit_on_texts(X_train)

X_train_seq = tokenizer.texts_to_sequences(X_train)
X_test_seq = tokenizer.texts_to_sequences(X_test)

Knowing all should become the same length vectors to fit into a neural network. How should I use the pad_sequences function from Keras to do this? Would this be (not sure about the maxlen):

X_train_seq _padded = pad_sequences(X_train_seq, maxlen = VOCAB_SIZE)
X_test_seq _padded = pad_sequences(X_test_seq, maxlen = VOCAB_SIZE)
SomeDutchGuy
  • 2,249
  • 4
  • 16
  • 42
  • Same sequence length is not compulsory but good practice for batch optimization. Check [this](https://stackoverflow.com/questions/66813950/movie-review-classification-with-recurrent-networks) QnA for details. Hope that helps. – Innat May 04 '21 at 13:42
  • @M.Innat getting from that, that I am using it wrong and should set it to the length of the longest entry in the training data or less. – SomeDutchGuy May 04 '21 at 13:50
  • It's not necessary to set the largest sequence length, rather a reasonable size. Fact is, it's one sort of hyper-parameter. – Innat May 04 '21 at 13:53

1 Answers1

1

Yes, your approach is right in using the pad_sequences option and technically your code will work and the model will run.

However, this may not be the best way to achieve what you're trying to do:

  • A general dictum with text data is that the average length of sequences is much smaller than the complete vocabulary
  • In your case for example, you can try looking for the average length of your sequences, or even the maximum length and it is very unlikely the number will be anywhere close to 10,000
  • If this is true in your data, the model is actually seeing very very sparse data being input, which can be easily turned into a denser form by choosing a better padding length

So you can leave your code intact, just replace the value of maxlen in pad_sequences(...) function to a more reasonable number

  • This maybe the maximum length of your input, or any other suitable metric
    • One approach that might be useful as you're starting out, is to set it to your input data's mean plus one standard deviation's length but naturally, this is very task specific
rishabhjha
  • 96
  • 2