9

I'm training a Word2Vec model like:

model = Word2Vec(documents, size=200, window=5, min_count=0, workers=4, iter=5, sg=1)

and Doc2Vec model like:

doc2vec_model = Doc2Vec(size=200, window=5, min_count=0, iter=5, workers=4, dm=1)
doc2vec_model.build_vocab(doc2vec_tagged_documents)
doc2vec_model.train(doc2vec_tagged_documents, total_examples=doc2vec_model.corpus_count, epochs=doc2vec_model.iter)

with the same data and comparable parameters.

After this I'm using these models for my classification task. And I have found out that simply averaging or summing the word2vec embeddings of a document performs considerably better than using the doc2vec vectors. I also tried with much more doc2vec iterations (25, 80 and 150 - makes no difference).

Any tips or ideas why and how to improve doc2vec results?

Update: This is how doc2vec_tagged_documents is created:

doc2vec_tagged_documents = list()
counter = 0
for document in documents:
    doc2vec_tagged_documents.append(TaggedDocument(document, [counter]))
    counter += 1

Some more facts about my data:

  • My training data contains 4000 documents
  • with 900 words on average.
  • My vocabulary size is about 1000 words.
  • My data for the classification task is much smaller on average (12 words on average), but I also tried to split the training data to lines and train the doc2vec model like this, but it's almost the same result.
  • My data is not about natural language, please keep this in mind.
ScientiaEtVeritas
  • 5,158
  • 4
  • 41
  • 59

1 Answers1

21

Summing/averaging word2vec vectors is often quite good!

It is more typical to use 10 or 20 iterations with Doc2Vec, rather than the default 5 inherited from Word2Vec. (I see you've tried that, though.)

If your main interest is the doc-vectors – and not the word-vectors that are in some Doc2Vec modes co-trained – definitely try the PV-DBOW mode (dm=0) as well. It'll train faster and is often a top-performer.

If your corpus is very small, or the docs very short, it may be hard for the doc-vectors to become generally meaningful. (In some cases, decreasing the vector size may help.) But especially if window is a large proportion of the average doc size, what's learned by word-vectors and what's learned by the doc-vectors will be very, very similar. And since the words may get trained more times, in more diverse contexts, they may have more generalizable meaning – unless you have a larger collections of longer docs.

Other things that sometimes help improve Doc2Vec vectors for classification purposes:

  • re-inferring all document vectors, at the end of training, perhaps even using parameters different from infer_vector() defaults, such as infer_vector(tokens, steps=50, alpha=0.025) – while quite slow, this means all docs get vectors from the same final model state, rather than what's left-over from bulk training

  • where classification labels are known, adding them as trained doc-tags, using the capability of TaggedDocument tags to be a list of tags

  • rare words are essentially just noise to Word2Vec or Doc2Vec - so a min_count above 1, perhaps significatly higher, often helps. (Singleton words mixed in may be especially damaging to individual doc-ID doc-vectors that are also, by design, singletons. The training process is also, in competition to the doc-vector, trying to make those singleton word-vectors predictive of their single-document neighborhoods... when really, for your purposes, you just want the doc-vector to be most descriptive. So this suggests both trying PV-DBOW, and increasing min_count.)

Hope this helps.

gojomo
  • 52,260
  • 14
  • 86
  • 115
  • What is the advantage of doc2vec over averaging word vectors? Does doc2vec account for a word's surroundings in the sentence while building the vector from the test sentence? Because that's one place where the word2vec doesn't help. – John Strood Oct 04 '18 at 12:44
  • 1
    Whether `Doc2Vec` works better than just averaging word vectors can depend on your corpus & goals. It's using similar inputs (word co-occurrences within context-windows or documents), & a similarly-sized predictive model (that generates the word or doc vectors), & similarly-sized text-representations (same number of dimensions), so scores on evaluations are likely to be in the same ballpark. In making each doc-vector predictive of all the words in each text, a doc-vector *might* model the text better than averages based on all those words' other occurrences. – gojomo Oct 04 '18 at 23:18
  • But it can make sense to try/tune both, especially since the simple-average (or some sort of weighted-average) can be so easy to calculate as a baseline. – gojomo Oct 04 '18 at 23:18