0

I am a student (computer science). This is my first question in stackoverflow. I really would appreciate your help! (The package I am referring to is called 'word2vec', thats why the tags/title are a bit confusing to choose.)

In the description of the doc2vec function (here https://cran.r-project.org/web/packages/word2vec/word2vec.pdf) it says:

Document vectors are the sum of the vectors of the words which are part of the document standardised by the scale of the vector space. This scale is the sqrt of the average inner product of the vector elements.

From what I understood, doc2vec takes one additional vector for every paragraph. Which, in my eyes, seems to be different than the above description.

Is my understanding of doc2vec correct, or close enough? And: Does the cited implementation work like the doc2vec-algorithm?

Jeremy Caney
  • 7,102
  • 69
  • 48
  • 77
  • Please read [(1)](https://stackoverflow.com/help/how-to-ask) how do I ask a good question, [(2)](https://stackoverflow.com/help/mcve) how to create a MCVE as well as [(3)](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example#answer-5963610) how to provide a minimal reproducible example in R. Then edit and improve your question accordingly. I.e., abstract from your real problem... – Christoph Nov 10 '20 at 17:14

1 Answers1

0

Many people use "Doc2Vec" to refer to the word2vec-like algorithm introduced by a paper titled Distributed Representation of Sentences and Documents (by Le & Mikolov). That paper calls the algorithm 'Paragraph Vector', without using the name 'Doc2Vec', and indeed introduces an extra vector per document, like you describe. (That is, the doc-vector is trained a bit like a 'floating' pseudoword-vector, that contributes to to the input 'context' for every training prediction in that document.)

I'm not familiar with R or that R word2vec package, but from the docs you forwarded, it does not sound like that doc2vec function implements the 'Paragraph Vector' algorithm that others call 'Doc2Vec'. In particular:

  • 'Paragraph Vector' doc-vectors are not a simple sum-of-word-vectors

  • 'Paragraph Vector' doc-vectors are created by a separate word2vec-like training process that co-creates any necessary word-vectors simultaneous with that training. Specifically: that process does not normally use as input some other pre-trained word-vectors, nor create word-vectors as a 1st step. (And further: the PV-DBOW option of the 'Paragraph Vector' paper doesn't create traditional word-vectors at all.)

It appears that function is poorly-named, and if you need to use the actual 'Paragraph Vector' algorithm, you will need to look elsewhere.

gojomo
  • 52,260
  • 14
  • 86
  • 115
  • 1
    In case someone else was wondering: Today the author of the c++ library (the r library is just a 'wrapper') said on the git page, that there is not going to be a 'paragraph vector' implementation. So this is another indicator that the 'doc2vec' function is not what you are expecting (or at least not what I was expecting) . reference: https://github.com/maxoodf/word2vec/issues/12 – Frederic Klein Nov 12 '20 at 13:56
  • 2
    If someone is still searching for a solution: Consider https://github.com/bnosac/doc2vec. It is written by the same author (wrapping to a different c++ package) – Frederic Klein Nov 23 '20 at 07:56