ML models with huge number of parameters will tend to overfit (since they have a large variance). In my opinion, word2vec
is one such models. One of the ways to reduce the model variance is to apply a regularization technique, which is very common thing for the other embedding models, such as matrix factorization. However, the basic version of word2vec
doesn't have any regularization part. Is there a reason for this?
Asked
Active
Viewed 2,153 times
10

kmario23
- 57,311
- 13
- 161
- 150

Tural Gurbanov
- 742
- 2
- 7
- 27
1 Answers
6
That's an interesting question.
I'd say that overfitting in Word2Vec doesn't make a lot of sense, because the goal of word embeddings to match the word occurrence distribution as exactly as possible. Word2Vec is not designed to learn anything outside of the training vocabulary, i.e., generalize, but to approximate the one distribution defined by the text corpus. In this sense, Word2Vec is actually trying to fit exactly, so it can't over-fit.
If you had a small vocabulary, it'd be possible to compute the co-occurrence matrix and find the exact global minimum for the embeddings (of a given size), i.e., get the perfect fit and that would define the best contextual word model for this fixed language.

Maxim
- 52,561
- 27
- 155
- 209
-
That's true, but during the training process we use a sample of the negative labels which are selected by using an additional network layer. The selection is done by feeding the word's embedding as an input to the sampling layer. Hence, if I am not wrong, the value of an embedding feature can influence the selection of the sample and as the result to influence the results of the model... – Tural Gurbanov Jan 15 '18 at 16:15