gensim Doc2Vec word not in vocabulary

Question

I am training a doc2vec gensim model with txt file 'full_texts.txt' that contains ~1600 documents. Once I have trained the model, I wish to use similarity methods over words and sentences.

However, since this is my first time using gensim , I am unable to get a solution. If I want to look for similarity by words I try as mentioned below but I get an error that the word doesnt exist in the vocabulary and on the other question is how do I check similarity for entire documents? I have read a lot of questions around it, like this one and looked up documentation but still not sure what I am doing wrong.

from gensim.models import Doc2Vec
from gensim.models.doc2vec import TaggedLineDocument
from gensim.models.doc2vec import TaggedDocument

tagdocs = TaggedLineDocument('full_texts.txt')
d2v_mod = Doc2Vec(min_count=3,vector_size = 200, workers = 2, window = 5, epochs = 30,dm=0,dbow_words=1,seed=42)
d2v_mod.build_vocab(tagdocs)
d2v_mod.train(tagdocs,total_examples=d2v_mod.corpus_count,epochs=20)

d2v_mod.wv.similar_by_word('overdraft',topn=10)
KeyError: "word 'overdraft' not in vocabulary"

score 3 · Answer 1 · answered Apr 27 '19 at 20:52

3

Are you sure 'overdraft' appears at least min_count=3 times in your corpus? (For example, what does grep -c " overdraft " full_texts.txt return?)

(Note also that 1600 docs is a very-small corpus for Doc2Vec purposes; published work typically uses at least tens-of-thousands of docs, and often millions.)

In general, if concerned about getting basic functionality working, good ideas are to:

follow trustworthy examples - the gensim docs/notebooks directory includes several Jupyter/IPython notebooks demonstrating doc2vec functionality, including the minimal intro doc2vec-lee.ipynb, also viewable online (but it's best to run locally so you can tinker with specifics to learn)
enable logging at the INFO level, and watch the output closely to make sure the various reported progress steps, including counts of words/docs and training durations, indicate everything is working sensibly
probe the resulting model for expected behavior. For example, is an expected word present in the learned vocabulary? 'overdrafts' in d2v_mod.wv. How many document tags were learned? len(d2v_mod.docvecs). etc

answered Apr 27 '19 at 20:52

gojomo

52,260
14
86
115

Thanks for your response @gojomo and thanks for your feedback here and advice on various github issues as well. Considering your advice I have tried a check, i.e. `min_count=1` and looked through `d2v_mod.wv.vocab.keys()` I can see many terms that appear here but they dont come up when I try e.g. `'overdrafts' in d2v_mod.wv` i.e. I get `False` as the output. – Shoaibkhanz Apr 28 '19 at 11:57
The ultimate aim to consider this is to create a pipeline for binary classification as Doc2Vec and LDA/TFIDF as an input. Further, I will get a large corpus to train. The problem is to classify complaints and queries. I should be able to get ~100k to ~200k documents. – Shoaibkhanz Apr 28 '19 at 12:03
Does the acceptance of the answer mean you're no longer having the problem, perhaps because re-training with `min_count=1` solved the issue? – gojomo Apr 28 '19 at 16:12
I just unchecked the acceptance to this answer, & to answer your question, I'll try to print some output; when I use the following code `print([[(dt_id2word[id], freq) for id, freq in cp] for cp in dt_corpus[6:7]])`, the result is the following `[[('second', 1), ('back', 1), ('online', 1), ('open', 1), ('website', 1), ('different', 1), ('due', 1), ('form', 1), ('form_complete', 1), ('locate', 1), ('near_branch', 1), ('new', 1), ('process', 1), ('return', 1), ('shocking', 1), ('switch', 1), ('try', 2), ('twice', 1)]]` I know `try` occured twice but still `'try' in d2v_mod.wv` returns `False` – Shoaibkhanz Apr 28 '19 at 17:40
With `min_count=3`, a word that only appears twice in the corpus will be dropped. Did you rebuild the model with `min_count=1`? Staying focused on your original example of a problem, that `'overdrafts'` was missing where expected, what does `grep -c " overdraft " full_texts.txt` return? – gojomo Apr 28 '19 at 17:55
I got 163 returned after running the following`grep -c "overdraft" full_texts.txt` – Shoaibkhanz Apr 28 '19 at 18:11
It'd be better to grep exactly as suggested, **with** leading & trailing spaces, to get exactly a count of space-delimited `'overdraft'` occurrences (& not count other words like `'overdrafts'` or `'overdrafted'` etc). But in any case, when you then train a `d2v_mod` with `TaggedLineDocument('full_texts.txt')`, what do `'overdraft' in d2v_mod.wv` and `d2v_mod.wv.vocab['overdraft'].count` return? If `True` & some number, you shouldn't get the `KeyError` anymore. – gojomo Apr 29 '19 at 17:55
OTOH, if you get other results for those tests & still get the `KeyError`, then comb over the INFO-level logs for hints something went wrong in the training. For example, are you sure you're using the right `full_texts.txt`? etc – gojomo Apr 29 '19 at 17:55

gensim Doc2Vec word not in vocabulary

1 Answers1