I have been tasked with putting a document vector model into production. I am an R user, and so my original model is in R. One of the avenues we have is to recreate the code and the models in Python.
I am confused by the Gensim implementation of Doc2vec.
The process that works in R goes like this:
Offline
Word vectors are trained using the functions in the
text2vec
package, namely GloVe or GlobalVectors, on a large corpus This gives me a large Word Vector text file.Before the ML step takes place, the
Doc2Vec
function from theTextTinyR
library is used to turn each piece of text from a smaller, more specific training corpus into a vector. This is not a machine learning step. No model is trained. The Doc2Vec function effectively aggregates the word vectors in the sentence, in the same sense that finding the sum or mean of vectors does, but in a way that preserves information about word order.Various models are then trained on these smaller text corpuses.
Online
- The new text is converted to Document Vectors using the pretrained word vectors.
- The Document Vectors are fed into the pretrained model to obtain the output classification.
The example code I have found for Gensim appears to be a radical departure from this.
It appears in gensim
that Doc vectors are a separate class of model from word vectors that you can train. It seems in some cases, the word vectors and doc vectors are all trained at once. Here are some examples from tutorials and stackoverflow answers:
https://medium.com/@mishra.thedeepak/doc2vec-simple-implementation-example-df2afbbfbad5
How to use Gensim doc2vec with pre-trained word vectors?
How to load pre-trained model with in gensim and train doc2vec with it?
gensim(1.0.1) Doc2Vec with google pretrained vectors
So my questions are these:
Is the gensim implementation of Doc2Vec fundamentally different from the TextTinyR implementation?
Or is the gensim doc2vec model basically just encapsulating the word2vec model and the doc2vec process into a single object?
Is there anything else I'm missing about the process?