Gensim: Is doc2vec a model or operation? Differences from R implementation

Question

I have been tasked with putting a document vector model into production. I am an R user, and so my original model is in R. One of the avenues we have is to recreate the code and the models in Python.

I am confused by the Gensim implementation of Doc2vec.

The process that works in R goes like this:

Offline

Word vectors are trained using the functions in the text2vec package, namely GloVe or GlobalVectors, on a large corpus This gives me a large Word Vector text file.
Before the ML step takes place, the Doc2Vec function from the TextTinyR library is used to turn each piece of text from a smaller, more specific training corpus into a vector. This is not a machine learning step. No model is trained. The Doc2Vec function effectively aggregates the word vectors in the sentence, in the same sense that finding the sum or mean of vectors does, but in a way that preserves information about word order.
Various models are then trained on these smaller text corpuses.

Online

The new text is converted to Document Vectors using the pretrained word vectors.
The Document Vectors are fed into the pretrained model to obtain the output classification.

The example code I have found for Gensim appears to be a radical departure from this.

It appears in gensim that Doc vectors are a separate class of model from word vectors that you can train. It seems in some cases, the word vectors and doc vectors are all trained at once. Here are some examples from tutorials and stackoverflow answers:

https://medium.com/@mishra.thedeepak/doc2vec-simple-implementation-example-df2afbbfbad5

How to use Gensim doc2vec with pre-trained word vectors?

How to load pre-trained model with in gensim and train doc2vec with it?

gensim(1.0.1) Doc2Vec with google pretrained vectors

So my questions are these:

Is the gensim implementation of Doc2Vec fundamentally different from the TextTinyR implementation?

Or is the gensim doc2vec model basically just encapsulating the word2vec model and the doc2vec process into a single object?

Is there anything else I'm missing about the process?

score 2 · Accepted Answer · answered Jun 18 '21 at 12:09

In R, you can use text2vec (https://cran.r-project.org/package=text2vec) to train Glove embeddings, word2vec (https://cran.r-project.org/package=word2vec) to train word2vec embeddings or train fasttext embeddings (https://cran.r-project.org/package=fastText / https://cran.r-project.org/package=fastTextR). You can aggregate these embeddings to the document level by just taking e.g. the average of the words or relevant nouns/adjectives (if you tag the text using udpipe (https://cran.r-project.org/package=udpipe) or use the approach from R package TextTinyR (https://cran.r-project.org/package=TextTinyR) which provides 3 other agrregation options: sum_sqrt / min_max_norm / idf

R package doc2vec (https://cran.r-project.org/package=doc2vec) allows one to train paragraph vector embeddings (PV-DBOW / PV-DM in Gensim terminology) which is not just averaging of word vectors but trains a specific model (e.g. see https://www.bnosac.be/index.php/blog/103-doc2vec-in-r). ruimtehol (https://cran.r-project.org/package=ruimtehol) allows to train Starspace embeddings which has the option of training sentence embeddings as well

Thanks for alerting me to the existence of this docvec package. It looks to be quite useful and much more in-line with the gensim implementation. — Ingolifs, Jun 23 '21 at 02:31

score 1 · Answer 2 · answered Jun 17 '21 at 21:48

I have no idea what the tinyTextR package's Doc2Vec function that you've mentioned is doing - Google searches turn up no documentation of its functionality. But if it's instant, and it requires word-vectors as an input, perhaps it's just averaging all the word-vectors for the text's words together.

You can read all about Gensim's Doc2Vec model in the Gensim documentation:

https://radimrehurek.com/gensim/models/doc2vec.html

As its intro explains:

Learn paragraph and document embeddings via the distributed memory and distributed bag of words models from Quoc Le and Tomas Mikolov: “Distributed Representations of Sentences and Documents”.

The algorithm that Gensim Doc2Vec implements is also commonly called 'Paragraph Vector' by its authors, including in the followup paper by Le et al "Document Embeddings With Paragraph Vector".

'Paragraph Vector' uses a word2vec-like training process to learn text-vectors for paragraphs (or other texts of many words). This process does not require prior word-vectors as an input, but many modes will co-train word-vectors along with the doc-vectors. It does require training on a set of documents, but after training the .infer_vector() method can be used to train-up vectors for new texts, not in the original training set, to the extent they use the same words. (Any new words in such post-model-training documents will be ignored.)

You might be able to approximate your R function with something simple like an average-of-word-vectors.

Or, you could try the alternate Doc2Vec in Gensim.

But, the Gensim Doc2Vec is definitely something different, and it's unfortunate the two libraries use the same Doc2Vec name for different processes.

Check out page 15 of https://cran.r-project.org/web/packages/textTinyR/textTinyR.pdf for textTinyR's doc2vec functionality. — Ingolifs, Jun 17 '21 at 22:34
So doing some tests on the `textTinyR` doc2vec function, on a sentence, then on the same sentence reversed and on it scrambled, all the doc2vec outputs are identical! I've used this package for a long time without noticing and I'm incredulous. — Ingolifs, Jun 17 '21 at 22:58
That doc's description of the possible 'methods' seems possibly idiosyncratic - I don't immediately recognize the calculation as similar to other common techniques with established names, but it definitely seems (1) calculated via some combination of the input word-vectors; and (2) totally unlike Gensim's `Doc2Vec`. — gojomo, Jun 18 '21 at 00:54
That the calculation is oblivious to word-order is understandable - a simple average of word-vectors has that quality, and the description in the doc you linked don't talk about word-neighbors or ordering, just the whole token list. (If it's oblivious to *character*-scrambling, something else is wrong!) — gojomo, Jun 18 '21 at 00:56

score 1 · Answer 3 · answered Jun 19 '21 at 17:28

I guess you are already aware of the Doc2Vec function documentation in the textTinyR package. What I'd like to add is the following information:

In the second vignette of the R package I mention: "... one of the three methods (sum_sqrt, min_max_norm, idf) to receive the transformed vectors. These methods are based on the following blog-posts (see especially www.linkedin.com/pulse/duplicate-quora-question-abhishek-thakur and https://erogol.com/duplicate-question-detection-deep-learning/) ..."
the doc2vec_methods use internally the word_vectors_methods function (which implements the previously mentioned 3 methods) in Rcpp to improve the computation time.

Just for the record I'm the author of the 'textTinyR' package.

Gensim: Is doc2vec a model or operation? Differences from R implementation

3 Answers3