2

How does word2vec create vectors for words? I trained two word2vec models using two different files (from commoncrawl website) but I am getting same word vectors for a given word from both models.

Actually, I have created multiple word2vec models using different text files from the commoncrawl website. Now I want to check which model is better among all. How can select the best model out of all these models and why I am getting same word vectors for different models?

Sorry, If the question is not clear.

1 Answers1

0

If you are getting identical word-vectors from models that you've prepared from different text corpuses, something is likely wrong in your process. You may not be performing any training at all, perhaps because of a problem in how the text iterable is provided to the Word2Vec class. (In that case, word-vectors would remain at their initial, randomly-initialized values.)

You should enable logging, and review the logs carefully to see that sensible counts of words, examples, progress, and incremental-progress are displayed during the process. You should also check that results for some superficial, ad-hoc checks look sensible after training. For example, does model.most_similar('hot') return other words/concepts somewhat like 'hot'?

Once you're sure models are being trained on varied corpuses – in which case their word-vectors should be very different from each other – deciding which model is 'best' depends on your specific goals with word-vectors.

You should devise a repeatable, quantitative way to evaluate a model against your intended end-uses. This might start crudely with a few of your own manual reviews of results, like looking over most_similar() results for important words for better/worse results – but should become more extensive. rigorous, and automated as your project progresses.

An example of such an automated scoring is the accuracy() method on gensim's word-vectors object. See:

https://github.com/RaRe-Technologies/gensim/blob/6d6f5dcfa3af4bc61c47dfdf5cdbd8e1364d0c3a/gensim/models/keyedvectors.py#L652

If supplied with a specifically-formatted file of word-analogies, it will check how well the word-vectors solve those analogies. For example, the questions-words.txt of Google's original word2vec code release includes the analogies they used to report vector quality. Note, though, that the word-vectors that are best for some purposes, like understanding text topics or sentiment, might not also be the best at solving this style of analogy, and vice-versa. If training your own word-vectors, it's best to choose your training corpus/parameters based on your own goal-specific criteria for what 'good' vectors will be.

gojomo
  • 52,260
  • 14
  • 86
  • 115
  • I have enabled debug logs and I am getting below log. `WARNING : train() called with an empty iterator (if not intended, be sure to provide a corpus that offers restartable iteration = an iterable).` And this one too. `worker exiting, processed 0 jobs` Am I missing anything here? Below is my code. ` model = gensim.models.Word2Vec(iter=3, min_count=1000, size=256, workers=16, window=15, max_vocab_size=15000000) sentences = Sentences(docs) model.build_vocab(sentences) sentences = Sentences(docs) model.train(sentences,total_words=15000000,epochs=5) ` – Uma Maheswara Rao Pinninti Sep 07 '17 at 06:23
  • Your `Sentences` class is probably not providing a restartable, iterABLE collection – instead, it's likely a simple iteratOR, that can return every item once but then is empty. You should double-check its code and test it outside of just passing it to `Word2Vec`. (If you need to add more code details, you may want to do so by editing your question, where it can be well formatted, rather than in these limited comments.) – gojomo Sep 07 '17 at 15:19
  • See this other SO answer for more on iterable-vs-iterator: https://stackoverflow.com/questions/9884132/what-exactly-are-pythons-iterator-iterable-and-iteration-protocols – gojomo Sep 07 '17 at 15:26