0

I am trying to apply the word2vec model implemented in the library gensim 3.6 in python 3.7, Windows 10 machine. I have a list of sentences (each sentences is a list of words) as an input to the model after performing preprocessing.

I have computed the results (obtaining 10 most similar words of a given input word using model.wv.most_similar) in Anaconda's Spyder followed by Sublime Text editor.

But, I am getting different results for the same source code executed in two editors.

Which result should I need to choose and Why?

I am specifying the screenshot of the results obtained by running the same code in both spyder and sublime text. The input word for which I need to obtain 10 most similar word is #universe#

I am really confused how to choose the results, on what basis? Also, I have started learning Word2Vec recently.

Any suggestion is appreciated.

Results Obtained in Spyder:

enter image description here

Results Obtained using Sublime Text: enter image description here

M S
  • 894
  • 1
  • 13
  • 41
  • Possible duplicate of [Ensure the gensim generate the same Word2Vec model for different runs on the same data](https://stackoverflow.com/questions/34831551/ensure-the-gensim-generate-the-same-word2vec-model-for-different-runs-on-the-sam) – G. Anderson Dec 04 '18 at 18:27
  • I don't think so. Here, I'm talking about same dataset, environment, platform. The only difference here are the editors. Your possible duplicate post suggest about different dimensions. – M S Dec 04 '18 at 20:19

1 Answers1

1

The Word2Vec algorithm makes use of randomization internally. Further, when (as is usual for efficiency) training is spread over multiple threads, some additional order-of-presentation randomization is introduced. These mean that two runs, even in the exact same environment, can have different results.

If the training is effective – sufficient data, appropriate parameters, enough training passes – all such models should be of similar quality when doing things like word-similarity, even though the actual words will be in different places. There'll be some jitter in the relative rankings of words, but the results should be broadly similar.

That your results are vaguely related to 'universe' but not impressively so, and that they vary so much from one run to another, suggest there may be problems with your data, parameters, or quantity of training. (We'd expect the results to vary a little, but not that much.)

How much data do you have? (Word2Vec benefits from lots of varied word-usage examples.)

Are you retaining rare words, by making min_count lower than its default of 5? (Such words tend not to get good vectors, and also wind up interfering with the improvement of nearby words' vectors.)

Are you trying to make very-large vectors? (Smaller datasets and smaller vocabularies can only support smaller vectors. Too-large vectors allow 'overfitting', where idiosyncracies of the data are memorized rather than generalized patterns learned. Or, they allow the model to continue improving in many different non-competitive directions, so model end-task/similarity results can be very different from run-to-run, even though each model is doing about-as-well as the other on its internal word-prediction tasks.)

Have you stuck with the default epochs=5 even with a small dataset? (A large, varied dataset requires fewer training passes - because all words appear many times, all throughout the dataset, anyway. If you're trying to squeeze results from thinner data, more epochs may help a little – but not as much as more varied data would.)

gojomo
  • 52,260
  • 14
  • 86
  • 115
  • Thanks for the suggestion. But I have 1 query. As you have mentioned that w2v deals with randomization initially, would it be efficient to run the model for 10 times and took the average of the 10 runs, similar to kind of cross validation used in the field of machine learning. – M S Dec 04 '18 at 20:45
  • Yeah. I have set the min_count as 5 – M S Dec 04 '18 at 20:52
  • 1
    Much better than averaging 10 separate runs would be finding parameters such that 2 runs are very similar, And that might mean something like 10 times a greater `epochs` training-passes value, or a much smaller vector size, or something else to match any data-limitations. – gojomo Dec 04 '18 at 23:27
  • I am having scarce knowledge about w2v model. What is meant by quantity of training? What is the difference between large size and small sized vectors? Does this depend on the vocabulary's size. Does it is required to set the epoch's value as 5 in every run? Again, in broad sense what is thinner data? – M S Dec 05 '18 at 13:20