Does it mean that I must provide tokenized words of a document as list of strings or simply a document as a list of string for the input doc_words. Please clarify
-
Does this answer your question? [How to use the infer\_vector in gensim.doc2vec?](https://stackoverflow.com/questions/44993240/how-to-use-the-infer-vector-in-gensim-doc2vec) – WiLL_K Jan 07 '20 at 15:26
-
@WiLL_K Yes the question quoted by you answers my doubt. Thank you so much your clarification rectified my model performance from 36% to 79%. I was feeding the input as whole document but I understood that we must feed the tokens of a document. Keep up your work. – shrikanth singh Jan 07 '20 at 16:58
1 Answers
The doc_words
should be a list of individual word-tokens as strings, equivalent to the words
of each training document during training. That is: it should have been preprocessed and tokenized the same as your training data was.
(When you ask in your question, "tokenized words of a document as list of strings or simply a document as a list of string", as far as I understand those words, those two alternatives are the same thing: a Python list
, where each item is a string word.)
Other important things to note about infer_vector()
:
inference always starts with a low-magnitude random vector, then iteratively improves that vector
words not known to the model will be silently ignored; at the extreme, if you supply a text with all unknown words, no inference will happen – but because of the random initialization above, you'll still get a vector back
if you don't specify an
epochs
value, it will reuse the value cached in the model (left over from model initialization or your lasttrain()
call). You will generally want it to use a number of epochs at least as large as was used in training – which is most commonly 10-20, but sometimes larger. (And, larger values may be especially helpful with shorter texts.)

- 52,260
- 14
- 86
- 115