12

I am trying to follow the official Doc2Vec Gensim tutorial mentioned here - https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb

I modified the code in line 10 to determine best matching document for the given query and everytime I run, I get a completely different resultset. My new code iin line 10 of the notebook is:

inferred_vector = model.infer_vector(['only', 'you', 'can', 'prevent', 'forest', 'fires']) sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs)) rank = [docid for docid, sim in sims] print(rank)

Everytime I run the piece of code, I get different set of documents that are matching with this query: "only you can prevent forest fires". The difference is stark and just does not seem to match.

Is Doc2Vec not a suitable match for querying and information extraction? Or are there bugs?

Rohan
  • 665
  • 9
  • 17

2 Answers2

16

Look into the code, in infer_vector you are using parts of the algorithm that is non-deterministic. Initialization of word vector is deterministic - see the code of seeded_vector, but when we look further, i.e., random sampling of words, negative sampling (updating only sample of word vector per iteration) could cause non-deterministic output (thanks @gojomo).

    def seeded_vector(self, seed_string):
        """Create one 'random' vector (but deterministic by seed_string)"""
        # Note: built-in hash() may vary by Python version or even (in Py3.x) per launch
        once = random.RandomState(self.hashfxn(seed_string) & 0xffffffff)
        return (once.rand(self.vector_size) - 0.5) / self.vector_size
l.augustyniak
  • 1,794
  • 1
  • 15
  • 15
  • but then this basically means that on every call to infer_vector, I get a different result for the same query. It is like searching google for something and getting a completely different result everytime. – Rohan Jan 21 '18 at 22:52
  • 3
    Yes, but they shouldn't be too different. If they are, your model may be underpowered/overfit, or your inference not using appropriate `steps` and starting `alpha`. It's common for many-more `steps` and/or a lower starting `alpha` more like the default training `alpha` of 0.025 to work better for inference than the defaults, especially on short docs. – gojomo Jan 22 '18 at 00:05
  • 2
    Note that it's *not* random start vector initialization that causes varied results per run – `seeded_vector()` function ensures identical starting vectors same `seed_string`s, & `Doc2Vec.infer_vector()` uses tokens you're inferring as `seed_string` in a deterministic way. Rather, it's other steps of the algorithm that inherently use random sampling of words, window sizes, or negative-examples. There are ways to force determinism in those steps, too, but that just hides the 'jitter' of the algorithm. It's better to ensure subsequent runs are 'similar enough' than 'artificially identical'. – gojomo Jan 22 '18 at 00:11
  • 2
    so increasing the steps to 500 and decreasing the alpha and min_alpha significantly led to the convergence of one consistent result. However, the result was still way off and did not look similar at all. The library publishers do not provide any recommendation or when to use this. Probably it is not suitable for smaller text documents or smaller set of documents. – Rohan Jan 22 '18 at 04:01
  • 4
    @gojomo thanks for pointing out my too fast conclusion. I looked through your posts on github (https://github.com/RaRe-Technologies/gensim/issues/447) related to `infer_vector` and I corrected my answer. – l.augustyniak Jan 22 '18 at 11:31
  • @gojomo thank you, it works. I have very small documents and get very unstable results with the defaults. However with `steps=200` and `alpha=0.00025` I get almost the same results every time – Antoine Feb 26 '19 at 22:28
  • @Antoine An inference starting `alpha=0.00025` is 100x smaller than a default/typical value - more like a typical *final* tiny value. If it works, great, but you might get as-good-or-better results with a somewhat-higher starting `alpha`, and fewer (and thus faster) `steps`. (Note also there have been some inference fixes in recent gensim releases – especially v3.5.0 of July 2018 – so be sure to upgrade if you're using anything older, and re-evaluate what values work best for your needs after upgrading.) – gojomo Feb 27 '19 at 01:27
  • that makes sense, thanks. In practice, I tried greater alphas and less steps, but was still observing some instability. I'm working with gensim `3.2.0` though. I'll upgrade and tell you. Thanks for the tip – Antoine Feb 27 '19 at 08:16
  • 2
    same observations with the latest version – Antoine Mar 06 '19 at 16:23
2

Set negative=0 to avoid randomization:

import numpy as np
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

documents = [list('asdf'), list('asfasf')]
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(documents)]
model = Doc2Vec(documents, vector_size=20,  window=5, min_count=1,  negative=0, workers=6, epochs=10) 
a = list('test sample')
b = list('testtesttest')
for s in (a, b):
    v1 = model.infer_vector(s)
    for i in range(100):
        v2 = model.infer_vector(s)
        assert np.all(v1 == v2), "Failed on %s" % (''.join(s))
James
  • 2,535
  • 1
  • 15
  • 14