I downloaded the stackoverflow dump (which is a 10GB file) and ran word2vec on the dump in order to get vector representations for programming terms (I require it for a project that I'm doing). Following is the code:
from gensim.models import Word2Vec
from xml.dom.minidom import parse, parseString
titles, bodies = [], []
xmldoc = parse('test.xml') //this is the dump
reflist = xmldoc.getElementsByTagName('row')
for i in range(len(reflist)):
bitref = reflist[i]
if 'Title' in bitref.attributes.keys():
title = bitref.attributes['Title'].value
titles.append([i for i in title.split()])
if 'Body' in bitref.attributes.keys():
body = bitref.attributes['Body'].value
bodies.append([i for i in body.split()])
dimension = 8
sentences = titles + bodies
model = Word2Vec(sentences, size=dimension, iter=100)
model.save('snippet_1.model')
Now, in order to calculate the cosine similarity between a pair of sentences, I do the following:
from gensim.models import Word2Vec
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
model = Word2Vec.load('snippet_1.model')
dimension = 8
snippet = 'some text'
snippet_vector = np.zeros((1, dimension))
for word in snippet:
if word in model.wv.vocab:
vecvalue = model[word].reshape(1, dimension)
snippet_vector = np.add(snippet_vector, vecvalue)
link_text = 'some other text'
link_vector = np.zeros((1, dimension))
for word in link_text:
if word in model.wv.vocab:
vecvalue = model[word].reshape(1, dimension)
link_vector = np.add(link_vector, vecvalue)
print(cosine_similarity(snippet_vector, link_vector))
I am calculating the sum of word embedding for each word of a sentence to get some representation for the sentence as a whole. I do this for both sentences and then calculate the cosine similarity between them.
Now, the problem is I'm getting cosine similarity around 0.99 for any pair of sentences that I give. Is there anything that I'm doing wrong? Any suggestions for a better approach?