2

I am training a doc2vec gensim model with txt file 'full_texts.txt' that contains ~1600 documents. Once I have trained the model, I wish to use similarity methods over words and sentences.

However, since this is my first time using gensim , I am unable to get a solution. If I want to look for similarity by words I try as mentioned below but I get an error that the word doesnt exist in the vocabulary and on the other question is how do I check similarity for entire documents? I have read a lot of questions around it, like this one and looked up documentation but still not sure what I am doing wrong.

from gensim.models import Doc2Vec
from gensim.models.doc2vec import TaggedLineDocument
from gensim.models.doc2vec import TaggedDocument

tagdocs = TaggedLineDocument('full_texts.txt')
d2v_mod = Doc2Vec(min_count=3,vector_size = 200, workers = 2, window = 5, epochs = 30,dm=0,dbow_words=1,seed=42)
d2v_mod.build_vocab(tagdocs)
d2v_mod.train(tagdocs,total_examples=d2v_mod.corpus_count,epochs=20)

d2v_mod.wv.similar_by_word('overdraft',topn=10)
KeyError: "word 'overdraft' not in vocabulary"
Shoaibkhanz
  • 1,942
  • 3
  • 24
  • 41

1 Answers1

3

Are you sure 'overdraft' appears at least min_count=3 times in your corpus? (For example, what does grep -c " overdraft " full_texts.txt return?)

(Note also that 1600 docs is a very-small corpus for Doc2Vec purposes; published work typically uses at least tens-of-thousands of docs, and often millions.)

In general, if concerned about getting basic functionality working, good ideas are to:

  • follow trustworthy examples - the gensim docs/notebooks directory includes several Jupyter/IPython notebooks demonstrating doc2vec functionality, including the minimal intro doc2vec-lee.ipynb, also viewable online (but it's best to run locally so you can tinker with specifics to learn)

  • enable logging at the INFO level, and watch the output closely to make sure the various reported progress steps, including counts of words/docs and training durations, indicate everything is working sensibly

  • probe the resulting model for expected behavior. For example, is an expected word present in the learned vocabulary? 'overdrafts' in d2v_mod.wv. How many document tags were learned? len(d2v_mod.docvecs). etc

gojomo
  • 52,260
  • 14
  • 86
  • 115
  • Thanks for your response @gojomo and thanks for your feedback here and advice on various github issues as well. Considering your advice I have tried a check, i.e. `min_count=1` and looked through `d2v_mod.wv.vocab.keys()` I can see many terms that appear here but they dont come up when I try e.g. `'overdrafts' in d2v_mod.wv` i.e. I get `False` as the output. – Shoaibkhanz Apr 28 '19 at 11:57
  • The ultimate aim to consider this is to create a pipeline for binary classification as Doc2Vec and LDA/TFIDF as an input. Further, I will get a large corpus to train. The problem is to classify complaints and queries. I should be able to get ~100k to ~200k documents. – Shoaibkhanz Apr 28 '19 at 12:03
  • Does the acceptance of the answer mean you're no longer having the problem, perhaps because re-training with `min_count=1` solved the issue? – gojomo Apr 28 '19 at 16:12
  • I just unchecked the acceptance to this answer, & to answer your question, I'll try to print some output; when I use the following code `print([[(dt_id2word[id], freq) for id, freq in cp] for cp in dt_corpus[6:7]])`, the result is the following `[[('second', 1), ('back', 1), ('online', 1), ('open', 1), ('website', 1), ('different', 1), ('due', 1), ('form', 1), ('form_complete', 1), ('locate', 1), ('near_branch', 1), ('new', 1), ('process', 1), ('return', 1), ('shocking', 1), ('switch', 1), ('try', 2), ('twice', 1)]]` I know `try` occured twice but still `'try' in d2v_mod.wv` returns `False` – Shoaibkhanz Apr 28 '19 at 17:40
  • With `min_count=3`, a word that only appears twice in the corpus will be dropped. Did you rebuild the model with `min_count=1`? Staying focused on your original example of a problem, that `'overdrafts'` was missing where expected, what does `grep -c " overdraft " full_texts.txt` return? – gojomo Apr 28 '19 at 17:55
  • I got 163 returned after running the following`grep -c "overdraft" full_texts.txt` – Shoaibkhanz Apr 28 '19 at 18:11
  • It'd be better to grep exactly as suggested, **with** leading & trailing spaces, to get exactly a count of space-delimited `'overdraft'` occurrences (& not count other words like `'overdrafts'` or `'overdrafted'` etc). But in any case, when you then train a `d2v_mod` with `TaggedLineDocument('full_texts.txt')`, what do `'overdraft' in d2v_mod.wv` and `d2v_mod.wv.vocab['overdraft'].count` return? If `True` & some number, you shouldn't get the `KeyError` anymore. – gojomo Apr 29 '19 at 17:55
  • OTOH, if you get other results for those tests & still get the `KeyError`, then comb over the INFO-level logs for hints something went wrong in the training. For example, are you sure you're using the right `full_texts.txt`? etc – gojomo Apr 29 '19 at 17:55