word2vec - KeyError: "word X not in vocabulary"

Question

Using the Word2Vec implementation of the module gensim in order to construct word embeddings for the sentences I do have in a plain text file. Despite the word happy is defined in the vocabulary, getting the error KeyError: "word 'happy' not in vocabulary". Tried to apply the given the answers to a similar question, but did not work. Hence, posted my own question.

Here is the code:

try:
    data = []
    with open(TXT_PATH, 'r', encoding='utf-8') as txt_file:
        for line in txt_file:
            for part in line.split(' '):
                data.append(part.strip())

    # When I debug, both of the words 'happy' and 'birthday' exist in the variable 'data'
    word2vec = Word2Vec(data, min_count=5, size=10000, window=5, workers=4)

    # Print result
    word_1 = 'happy'
    word_2 = 'birthday'
    print(f'Similarity between {word_1} and {word_2} thru word2vec: {word2vec.similarity(word_1, word_2)}')
except Exception as err:
    print(f'An error happened! Detail: {str(err)}')

score 2 · Accepted Answer · answered Nov 01 '19 at 23:27

When you get a "not in vocabulary" error like this from Word2Vec, you can trust it: 'happy' really isn't in the model.

Even if your visual check shows 'happy' inside your file, a few reasons why it might not wind up inside the model include:

it doesn't occur at least min_count=5 times
the data format isn't correct for Word2Vec, so it's not seeing the words you expect it to see.

Looking at how data is prepared by your code, it looks like a giant list of all words in your file. Word2Vec instead expects a sequence that has, as each item, a list-of-words for that one text. So: not a list-of-words, but a list where each item is a list-of-words.

If you've supplied...

[
  'happy',
  'birthday',
]

...instead of the expected...

[
  ['happy', 'birthday',],
]

...those single-word-strings will be seen a lists-of-characters, so Word2Vec will think you want to learn word-vectors for a bunch of one-character words. You can check if this has affected your model by seeing if the vocabulary size seems small (len(model.wv)) or if a sample of learned-words is only single-character words ('model.wv.index2entity[:10]`).

If you supply a word in the right format, at least min_count times, as part of the training-data, it will wind up with a vector in the model.

(Separately: size=10000 is a choice way outside the usual range of 100-400. I've never seen a project using such high-dimensionality for word-vectors, and it would only be theoretically justifiable if you had a massively-large vocabulary and training-set. Oversized vectors with smaller vocabularies/data are likely to create uselessly overfit results.)

Thank you very much for your contribution. The modifications I have done through your feedback worked like a charm. When I decreased the `size` to 100, and 400, got the `similarity score` of 0.795, and 0.7177, respectively which was 0.928 before. — talha06, Nov 01 '19 at 23:38

word2vec - KeyError: "word X not in vocabulary"

1 Answers1

Linked