Gensim- KeyError: 'word not in vocabulary'

Question

I am trying to achieve something similar in calculating product similarity used in this example. how-to-build-recommendation-system-word2vec-python/

I have a dictionary where the key is the item_id and the value is the product associated with it. For eg: dict_items([('100018', ['GRAVY MIX PEPPER']), ('100025', ['SNACK CHEEZIT WHOLEGRAIN']), ('100040', ['CAULIFLOWER CELLO 6 CT.']), ('100042', ['STRIP FRUIT FLY ELIMINATOR'])....)

The data structure is the same as in the example (as far as I know). However, I am getting KeyError: "word '100018' not in vocabulary" when calling the similarity function on the model using the key present in the dictionary.

# train word2vec model
model = Word2Vec(window = 10, sg = 1, hs = 0,
             negative = 10, # for negative sampling
             alpha=0.03, min_alpha=0.0007,
             seed = 14)
model.build_vocab(purchases_train, progress_per=200)
model.train(purchases_train, total_examples = model.corpus_count, 
        epochs=10, report_delay=1)

def similar_products(v, n = 6): #similarity function

# extract most similar products for the input vector
ms = model.similar_by_vector(v, topn= n+1)[1:]

# extract name and similarity score of the similar products
new_ms = []
for j in ms:
    pair = (products_dict[j[0]][0], j[1])
    new_ms.append(pair)
    
return new_ms

I am calling the function using:

similar_products(model['100018'])

Note: I was able to run the example code with the very similar data structure input which was also a dictionary. Can someone tell me what I am missing here?

always put full error message (starting at word "Traceback") in question (not in comments) as text (not screenshot, not link to external portal). There are other useful information. — furas, Jun 06 '22 at 16:56
you call with `100018` but error shows problem with `138021`. Maybe you runs different code. Maybe first check if you really have `138021` in dictionary. — furas, Jun 06 '22 at 16:58
Maybe first use `print()` (and `print(type(...))`, `print(len(...))`, etc.) to see which part of code is executed and what you really have in variables. It is called `"print debuging"` and it helps to see what code is really doing. — furas, Jun 06 '22 at 16:59
@furas Thank you for pointing out a discrepancy in the key in the above example but it wouldn't work with any of the keys. I printed out to see the word occurrence in the training set and it was less than 5. And as gojomo pointed below words occurring less than 5 times or infrequently doesn't get a good vector representation. Thank you for your suggestions. I will incorporate them in my future questions. — Neo, Jun 07 '22 at 18:28

score 1 · Accepted Answer · answered Jun 07 '22 at 09:15

If you get a KeyError telling you a word isn't in your model, then the word genuinely isn't in the model.

If you've trained the model yourself, and expected the word to be in the resulting model, but it isn't, something went wrong with training.

You should look at the corpus (purchases_train in your code) to make sure each item is of the form the model expects: a list of words. You should enable logging during training, and watch the output to confirm the expected amount of word-discovery and training is happening. You can also look at the exact list-of-words known-to-the-model (in model.wv.key_to_index) to make sure it has all the words you expect.

One common gotcha is that by default, for the best operation of the word2vec algorithm, the Word2Vec class uses a default min_count=5. (Word2vec only works well with multiple varied examples of a word's usage; a word appearing just once, or just a few times, usually won't get a good vector, and further, might make other surrounding word's vectors worse. So the usual best practice is to discard very-rare words.

Is the (pseudo-)word '100018' in your corpus less than 5 times? If so, the model will ignore it as a word too-rare to get a good vector, or have any positive influence on other word-vectors.

Separately, the site you're using example code from may not be a quality source of example code. It's changed a bunch of default values for no good reason - such as changing the alpha and min_alpha values to peculiar non-standard values, with no comment why. This is usually a signal that someone who doesn't know what they're doing is copying someone else who didn't know what they were doing's odd choices.

It was the word occurrence in the training set. Thank you for taking time to answer this. I was able to figure this one out following a similar thread https://stackoverflow.com/questions/58666699/word2vec-keyerror-word-x-not-in-vocabulary?noredirect=1&lq=1 I think it was you providing the accepted answer there too. Thanks! — Neo, Jun 07 '22 at 18:23

Gensim- KeyError: 'word not in vocabulary'

1 Answers1