0

I've got a problem/question with Word2Vec

As I understand: let's train a model on a corpus of text (in my way it's a corpus ~2 Gb size) Let's take one line from this text and calculate a vector of this line (line's vector = sum of words vectors). It will be smth. like this:

for w in words:
    coords += model[w]

Than let's calculate length of this vector. With standard library as:

import numpy as np
vectorLen = np.linalg.norm(coords)

Why do we need Word2Vec? Yes, for converting words to vectors AND contextual proximity (near words that are found and words that are close in meaning have similar coordinates)!

And what I want (what I am waiting) - if I will take some line of the text and add some word from the dictionary which is not typical for this line, than again calculate length of this vector, I will get quite different value that if I will calculate only vector of this line without adding some uncharacteristic words to this line from dictionary.

But in fact - the values of this vectors (before adding word(s) and after) are quite the similar! Moreover - they are practically the same! Why am I getting this result? If I understand right for the line the coordinates of words will quite the same (contextual proximity), but new words will have rather different coordinates and it should affect to result (vector length of line with new words)!

E.x. it's my W2V model settings:

#Word2Vec model

model = gensim.models.Word2Vec(
    sg=0,
    size=300,
    window=3,
    min_count=1,
    hs=0,
    negative=5,
    workers=10,
    alpha=0.025,
    min_alpha=0.025,
    sample=1e-3,
    iter=20
)

#prepare the model vocabulary
model.build_vocab(sentences, update=False)

#train model
model.train(sentences, epochs=model.iter, total_examples=model.corpus_count)

OR this:

#Word2Vec model

model = gensim.models.Word2Vec(
    sg=1,
    size=100,
    window=10,
    min_count=1,
    hs=0,
    negative=5,
    workers=10,
    alpha=0.025,
    min_alpha=0.025,
    seed=7,
    sample=1e-3,
    hashfxn=hash,
    iter=20
)

#prepare the model vocabulary
model.build_vocab(sentences, update=False)

What's the problem? And how can I get necessary result?

user100123122
  • 309
  • 4
  • 11

1 Answers1

0

Why do you need the "vector length" to noticeably change, as a "desired result"?

The length of word-vectors (or sums of same) isn't usually of major interest. In fact, it's common to normalize the word-vectors to unit-length before doing comparisons. (And sometimes, when doing sums/averages as a simple way to create vectors for runs-of-multiple-words, the vectors might be unit-normalized before or after such an operation.)

Instead, it's usually the direction (angle) that's of most interest.

Further, what do you mean when describing the length values as "quite the similar"? Without showing the actual lengths you've seen in your tests, it's unclear if your intuitions about what the change "should" be are correct.

Note that in multi-dimensional spaces – and especially high-dimensional spaces - our intuitions are quite often wrong.

For example, try adding a bunch of pairs of random unit vectors in 2d space, and looking at the norm length of the sum. As you might expect, you'll likely see varied results that range from nearly 0.0 to nearly 2.0 – representing moving closer or further to the origin.

Try instead adding a bunch of pairs of random unit vectors in 500d space. Now, the norm length of the sum is going to almost always be close to 1.4. Essentially, with 500 directions to go, most sums won't significantly move closer or further to the origin, even though they still move 1.0 away from either vector individually.

You're likely observing the same thing with your word-vectors. They're fine, but the measure you've chosen to take – the norm of a vector sum – just doesn't change the way you'd expect, in a high-dimensional space.

Separately, unrelated to your main issue, but about your displayed word2vec parameters:

  • You might think using a non-default min_count=1, by retaining more words/information, results in better vectors. However, it usually hurts word-vector quality to retain such rare words. Word-vector quality requires many varied examples of word usage. Words with just 1, or a few, examples don't get good vectors from those few idiosyncratic usage examples, but do serve as training noise/interference in the improvement of other word-vectors with more examples.
  • Usual stochastic-gradient-descent optimization relies on the alpha learning-rate decaying to a negligible value over the course of training. Setting the ending min_alpha to the same value as the starting alpha thwarts this. (In general, most users shouldn't change either of the alpha parameters, and if they need to tinker at all, changing the starting value makes more sense.)
gojomo
  • 52,260
  • 14
  • 86
  • 115
  • About question "Why do you need the "vector length" to noticeably change, as a "desired result"?" I will ask in another words: I exploring log files of the system to search anomaly situations. So, for the part I use Word2Vec. I will get a dictionary with all words with them proximity. And I want to detect a situation when some word meet in some line log but in "normal mode" of the system we couldn't find it! –  Feb 08 '19 at 13:54
  • So I tried to learn a "normal situations" (by calculating this vectors), and if one non-typical word for current line will appear, it will influence to all value line (because this word has distant coordinates from other words of this line) –  Feb 08 '19 at 14:00
  • Maybe there is some another method to solve this task? To detect this words in log lines? I understand and agree with you - in a vector space with 300 coordinates it's to difficult to detect this anomaly word in log line - the contribution will not be so great .. –  Feb 08 '19 at 14:02
  • I'm not sure I understand your question, but there **is** an existing `gensim` method on word-vectors, `doesnt_match()`, which takes a list of words, and reports the one word which is furthest from the average of all the words. It'll usually report the sort of word a human would report as being "not like the others". You could try it, but I kind of doubt it would work on log lines of an automated process. (Don't such log lines include lots of non-natural-language data? And lots of arbitrarily-finely-different numeric values? Word2vec works best on language-like token distributions.) – gojomo Feb 08 '19 at 18:23
  • I do a big preprocessing before using W2V: delete all non-alphabet characters; stemming; delete stop words. And after this preprocessing I have a good data to use W2V –  Feb 08 '19 at 18:51