sorry if this is a noob question although i havent found a similar thread... I'm trying to learn how to create word embeddings using a large dataset of tweets for sentiment classification. I'm using Keras TextVectorizer to convert the tweets into sequences. I noticed that if a word is not in the vocabulary specified, it always maps to integer 1. Wouldn't that mean that the model will also learn weights for words that are not in the vocabulary? If yes, how do you avoid that?
Here's a snippet:
vectorizer = tfl.TextVectorization(
#ax_tokens=vocab_size,
output_mode='int',
output_sequence_length=50,
standardize=std,
vocabulary=vocab)
test = np.array(['dogs are very cute wordnotinvocabulary'])
vectorizer(test)
Output: <tf.Tensor: shape=(1, 50), dtype=int64, numpy= array([[425842, 52874, 305572, 514379, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int64)>