0

sorry if this is a noob question although i havent found a similar thread... I'm trying to learn how to create word embeddings using a large dataset of tweets for sentiment classification. I'm using Keras TextVectorizer to convert the tweets into sequences. I noticed that if a word is not in the vocabulary specified, it always maps to integer 1. Wouldn't that mean that the model will also learn weights for words that are not in the vocabulary? If yes, how do you avoid that?

Here's a snippet:

vectorizer = tfl.TextVectorization(
 #ax_tokens=vocab_size,
 output_mode='int',
 output_sequence_length=50,
 standardize=std,
 vocabulary=vocab)

test = np.array(['dogs are very cute wordnotinvocabulary'])
vectorizer(test)

Output: <tf.Tensor: shape=(1, 50), dtype=int64, numpy= array([[425842, 52874, 305572, 514379, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int64)>

1 Answers1

0

The Keras TextVectorization layer will reserve a token for Out of Vocabulary (OOV) words. This means the layer will indeed learn weights for words that aren't in the vocabulary, but it specifically only learns a single weight for all possible words outside of the vocabulary. I'm not sure why you would want to avoid this. It doesn't use up too much extra space since there's only one extra word embedding you would need to learn, and it does still convey some information to the model that a word is there.

If you wanted to remove it, you could probably replace all 1s from this layer's output with 0s using something like the answer here.

rchome
  • 2,623
  • 8
  • 21