Why the similarity beteween two bag-of-words in gensim.word2vec calculated this way?

Question

def n_similarity(self, ws1, ws2):
    v1 = [self[word] for word in ws1]
    v2 = [self[word] for word in ws2]
    return dot(matutils.unitvec(array(v1).mean(axis=0)), matutils.unitvec(array(v2).mean(axis=0)))

This is the code I excerpt from gensim.word2Vec, I know that two single words' similarity can be calculated by cosine distances, but what about two word sets? The code seems to use the mean of each wordvec and then calculated on the two mean vectors' cosine distance. I know few in word2vec, is there some foundations of such process?

score 2 · Accepted Answer · edited May 23 '17 at 11:52

Taking the mean of all word vectors is the simplest way of reducing them to a single vector so cosine similarity can be used. The intuition is that by adding up all word vectors you get a bit of all of them (the meaning) in the result. You then divide by the number of vectors so that larger bag of words don't end up with longer vectors (not that it matters for cosine similarity anyway).

There are other ways to reduce an entire sentence to a single vector is a complex one. I wrote a bit about it in a related question on SO. Since then a bunch of new algorithms have been proposed. One of the more accessible ones is Paragraph Vector, which you shouldn't have problems understanding if you are familiar with word2vec.

But even the paragraph vectors use the mean of words during the training process as a representation of the whole sentence/paragraph. — The Wanderer, Oct 27 '15 at 11:51

Why the similarity beteween two bag-of-words in gensim.word2vec calculated this way?

1 Answers1

Linked