0

I downloaded the stackoverflow dump (which is a 10GB file) and ran word2vec on the dump in order to get vector representations for programming terms (I require it for a project that I'm doing). Following is the code:

from gensim.models import Word2Vec
from xml.dom.minidom import parse, parseString

titles, bodies = [], []
xmldoc = parse('test.xml') //this is the dump 
reflist = xmldoc.getElementsByTagName('row')
for i in range(len(reflist)):
    bitref = reflist[i]
    if 'Title' in bitref.attributes.keys():
        title = bitref.attributes['Title'].value
        titles.append([i for i in title.split()])
    if 'Body' in bitref.attributes.keys():
        body = bitref.attributes['Body'].value
        bodies.append([i for i in body.split()])

dimension = 8
sentences = titles + bodies
model = Word2Vec(sentences, size=dimension, iter=100)
model.save('snippet_1.model')

Now, in order to calculate the cosine similarity between a pair of sentences, I do the following:

from gensim.models import Word2Vec
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

model = Word2Vec.load('snippet_1.model')
dimension = 8
snippet = 'some text'
snippet_vector = np.zeros((1, dimension))
for word in snippet:
    if word in model.wv.vocab:
        vecvalue = model[word].reshape(1, dimension)
        snippet_vector = np.add(snippet_vector, vecvalue)

link_text = 'some other text'
link_vector = np.zeros((1, dimension))
for word in link_text:
    if word in model.wv.vocab:
        vecvalue = model[word].reshape(1, dimension)
        link_vector = np.add(link_vector, vecvalue)

print(cosine_similarity(snippet_vector, link_vector))

I am calculating the sum of word embedding for each word of a sentence to get some representation for the sentence as a whole. I do this for both sentences and then calculate the cosine similarity between them.

Now, the problem is I'm getting cosine similarity around 0.99 for any pair of sentences that I give. Is there anything that I'm doing wrong? Any suggestions for a better approach?

morghulis
  • 145
  • 11
  • You might have better luck using the n_similarity of the Word2Vec model object as discussed in this question: http://stackoverflow.com/questions/26010645/why-the-similarity-beteween-two-bag-of-words-in-gensim-word2vec-calculated-this – David Mar 21 '17 at 17:43
  • maybe too late, but I got this same results for word2vec for almost any words when compared using similarity. Then I notice that the size of the corpus used to train word2vec was too small for the model the learn the underline vectors weights. When I increase the size of the corpus the model started to get better results. I guess there is a minimum amount of docs that word2vec need to start performing well. – Savrige Jan 21 '20 at 14:35

1 Answers1

0

Are you checking that your snippet_vector and link_vector are meaningful vectors before calculating their cosine-similarity?

I suspect they're just zero-vectors, or similarly non-diverse, since your for word in snippet: and for word in link_text: loops aren't tokenizing the text. So they'll just loop over the characters in each string, which either won't be present in your model as words, or the few available may match exactly between your texts. (Even with tokenization, the texts' summed vectors would only differ by the value of a vector for the one different word, 'other'.)

gojomo
  • 52,260
  • 14
  • 86
  • 115
  • Yes, I checked the vectors. There were a few common keywords. Anyway, the results improved by a bit after I increased the 'iter' parameter. I think I can get far better results by increasing it further. But it would take many days to run the code. – morghulis Mar 31 '17 at 03:50
  • OK, but if your code is as shown in question, your `for word in snippet:` loop will have `word` loop through values `'s'`, `'o'`, `'m'`, `'e`', `' '`, `'w'`, `'o'`, `'r'`, `'d'` – *not* `'some'` and `'word'`. Separately, 100 iterations is insanely high, especially for a large dataset. (Larger datasets can usually get by with *fewer* iterations, because words repeat in many contexts.) With a large dataset you can also decrease `negative` & `sample` (& maybe even `window`) to speed things up with little loss. Meanwhile, a vector size of 8 is so small I'd doubt the vectors are useful for much. – gojomo Mar 31 '17 at 06:45
  • If your machine has 4 or more cores, you may also get a speedup by increasing the `workers` parameter. (Up to the number of cores, up to 8, is worth trying, and watching the log output to see what value maximizes words-per-second throughput.) – gojomo Mar 31 '17 at 06:50