Customize word embeddings to your own vocabulary

Question

I have a vocabulary related to restaurant stuff in Spanish and I am using predefined word embeddings in Spanish with FastText and Bert, however, I see that there are a lot of out-of-vocabulary (oov) words that are not recognized by the predefined word embeddings. Also, my vocabulary is very limited, so it does not make sense to train word embeddings just for that from scratch. Is there any approach I could follow to expand the predefined word embeddings to include word embeddings for most oov?

Thank you

score 0 · Answer 1 · answered Aug 08 '23 at 13:37

FastText can obtain vectors even for out-of-vocabulary (OOV) words, by summing up vectors for its component char-ngrams, provided at least one of the char-ngrams was present in the training data.

Same applies to BERT that use the BPE algorithm for tokenization.

I'm not sure about the details of your implementation but if your embedding model was trained on a large enough corpus in Spanish. The model should contain the char-ngrams and be able to return an embedding for more specific words as well.

You can try another, maybe larger embedding model.

score 0 · Answer 2 · answered Aug 09 '23 at 00:26

I would suggest obtaining enough training text that's closely-related to your domain to train your own model. For example, if you can acquire/scrape enough Spanish-language resaurant reviews, cookbooks, foodie discussions, etc, such that you have many mentions of all words important to your final use, you'd likely obtain a custom domain-centric model with very-good vectors for your important terms.

I know you think your vocabulary-of-interest is too small, but if there are truly "a lot" of words that are missing from more general pretrained models, the best way to model those words well is to do real training, with sufficient data, for a model to learn those words. And to place them in proper relation to other more-general words, the training should include those words in realistic contexts with other words.

Using FastText for such a custom-corpus training can help a bit with long-tail words that are similar-to (shared subranges with) known words - the resulting model's synthesized vectors for unknown words can be far better than nothing if those out-of-vocab ("OOV") words are small changes from related known words, like alternate spellings/word-forms/typos. But you still want lots of varied domain-specific usages to really model the words well.

If you in fact need word-vectors that are coordinates-compatible with some other frozen-in-place vectors from elsewhere, that's a trickier problem. But there are ways to "project" (coordinate-translate) words from one model into another's coordinate space, assuming each model is on its own pretty good, and there are a large number of common words to serve as the anchors/guideposts for learning-a-transformation.

The general strategy is briefly described in section 2.2 ("Vocabulary Expansion") of Google's 2015 paper "Skip-Thought Vectors". The Python library Gensim has a class TranslationMatrix & example notebook that may be able to help with such coordinate-projection.

But even with this technique, you'd train-up your own separate word-vector model which is sure to have plentiful usage examples of your words-of-interest. But then, you'd coordinate-translate those new words into the larger pretrained word-vector model, using words that appear in both models as a 'map' to where the new words should be projected-over.

Customize word embeddings to your own vocabulary

2 Answers2