0

I am trying to see what pre-trained model has included common phrases in news and I thought GoogleNews-vectors-negative300.bin should be a comprehensive one but it turned out that it does not even include deep_learning, machine_learning, social_network, social_responsibility. What pre-trained model could include those words that often occur in news, public reports?

import gensim

# Load Google's pre-trained Word2Vec model.
model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz', binary=True)

model.similarity('deep_learning', 'machine_learning')
sophros
  • 14,672
  • 11
  • 46
  • 75
John
  • 331
  • 2
  • 11

2 Answers2

0

These are MWE (Multi-Word Expressions) that are unlikely to be included. You could theoretically model them by taking an average of the vectors obtained for each of the words comprising an MWE.

The different considerations for operations applicable for comprising vectors and the obtained results is: word2vec - what is best? add, concatenate or average word vectors?

sophros
  • 14,672
  • 11
  • 46
  • 75
0

The GoogleNews vectors were trained by Google, circa 2012-2013, on a large internal corpus of news articles.

Further, the promotion of individual words into multiword phrases seems to have been done using a purely-statistical co-occurrence analysis (similar to that implemented by gensim Phrases class) - so often won't match human-level perception of entities/concepts, missing some word combos, over-combining others.

So, concepts that were obscure (or not even yet coined!) then, or seldom covered in news articles, will be missing or underrepresented.

Training your own vectors on text from your own domain of interest is often best, for both coverage, and to ensure the vectors reflect word/phrase senses that are dominant in your texts – not general news or reference materials.

gojomo
  • 52,260
  • 14
  • 86
  • 115