I've got a problem/question with Word2Vec
As I understand: let's train a model on a corpus of text (in my way it's a corpus ~2 Gb size) Let's take one line from this text and calculate a vector of this line (line's vector = sum of words vectors). It will be smth. like this:
for w in words:
coords += model[w]
Than let's calculate length of this vector. With standard library as:
import numpy as np
vectorLen = np.linalg.norm(coords)
Why do we need Word2Vec? Yes, for converting words to vectors AND contextual proximity (near words that are found and words that are close in meaning have similar coordinates)!
And what I want (what I am waiting) - if I will take some line of the text and add some word from the dictionary which is not typical for this line, than again calculate length of this vector, I will get quite different value that if I will calculate only vector of this line without adding some uncharacteristic words to this line from dictionary.
But in fact - the values of this vectors (before adding word(s) and after) are quite the similar! Moreover - they are practically the same! Why am I getting this result? If I understand right for the line the coordinates of words will quite the same (contextual proximity), but new words will have rather different coordinates and it should affect to result (vector length of line with new words)!
E.x. it's my W2V model settings:
#Word2Vec model
model = gensim.models.Word2Vec(
sg=0,
size=300,
window=3,
min_count=1,
hs=0,
negative=5,
workers=10,
alpha=0.025,
min_alpha=0.025,
sample=1e-3,
iter=20
)
#prepare the model vocabulary
model.build_vocab(sentences, update=False)
#train model
model.train(sentences, epochs=model.iter, total_examples=model.corpus_count)
OR this:
#Word2Vec model
model = gensim.models.Word2Vec(
sg=1,
size=100,
window=10,
min_count=1,
hs=0,
negative=5,
workers=10,
alpha=0.025,
min_alpha=0.025,
seed=7,
sample=1e-3,
hashfxn=hash,
iter=20
)
#prepare the model vocabulary
model.build_vocab(sentences, update=False)
What's the problem? And how can I get necessary result?