3

I trained a gensim Word2Vec model. Let's say I have a certain vector and I want the find the word it represents - what is the best way to do so?

Meaning, for a specific vector:

vec = array([-0.00449447, -0.00310097,  0.02421786, ...], dtype=float32)

I want to get a word:

 'computer' = model.vec2word(vec)
oren_isp
  • 729
  • 1
  • 7
  • 22
  • 2
    Possible duplicate of [How to find the closest word to a vector using word2vec](https://stackoverflow.com/questions/32759712/how-to-find-the-closest-word-to-a-vector-using-word2vec) – Veltzer Doron Aug 15 '18 at 08:52
  • 1
    You don't get the word it represents but the most similar words. These can be the word in the corpus if you have given it the exact vector representation for it (distance=0). But the whole idea of word2vec is that you get representations of words in the corpus with a semantic/syntactic distance measure as represented by the distance between their related word vectors. – Veltzer Doron Aug 15 '18 at 08:54

2 Answers2

2

Word-vectors are generated through an iterative, approximative process – so shouldn't be thought of as precisely right (even though they do have exact coordinates), just "useful within certain tolerances".

So, there's no lookup of exact-word-for-exact-coordinates. Instead, in gensim Word2Vec and related classes there's most_similar(), which gives the known words closest to given known-words or vector coordinates, in ranked order, with the cosine-similarities. So if you've just trained (or loaded) a full Word2Vec model into the variable model, you can get the closest words to your vector with:

vec = array([-0.00449447, -0.00310097,  0.02421786, ...], dtype=float32)
similars = model.wv.most_similar(positive=[vec])
print(similars)

If you just want the single closest word, it'd be in similars[0][0] (the first position of the top-ranked tuple).

gojomo
  • 52,260
  • 14
  • 86
  • 115
0

This is now supported via vocab.vectors.most_similar

import spacy
nlp = spacy.load('en_core_web_md')
word_vec = nlp(u"Test").vector
result = nlp.vocab.vectors.most_similar(word_vec.reshape((1, -1)))
print(nlp.vocab.strings[result[0][0,0]], result)