Update gensim word2vec model

Question

I have a word2vec model in gensim trained over 98892 documents. For any given sentence that is not present in the sentences array (i.e. the set over which I trained the model), I need to update the model with that sentence so that querying it next time gives out some results. I am doing it like this:

new_sentence = ['moscow', 'weather', 'cold']
model.train(new_sentence)

and its printing this as logs:

2014-03-01 16:46:58,061 : INFO : training model with 1 workers on 98892 vocabulary and 100 features
2014-03-01 16:46:58,211 : INFO : reached the end of input; waiting to finish 1 outstanding jobs
2014-03-01 16:46:58,235 : INFO : training on 10 words took 0.1s, 174 words/s

Now, when I query with similar new_sentence for most positives (as model.most_similar(positive=new_sentence)) it gives out error:

Traceback (most recent call last):
 File "<pyshell#220>", line 1, in <module>
 model.most_similar(positive=['moscow', 'weather', 'cold'])
 File "/Library/Python/2.7/site-packages/gensim/models/word2vec.py", line 405, in most_similar
 raise KeyError("word '%s' not in vocabulary" % word)
  KeyError: "word 'cold' not in vocabulary"

Which indicates that the word 'cold' is not part of the vocabulary over which i trained the thing (am I right)?

So the question is: How to update the model so that it gives out all the possible similarities for the given new sentence?

Someone has updated genism's `Word2Vec` to an `online Word2Vec`. Where you can update your vocabulary list and learn new ones using online learning. I have not tried it though, but check it out at: http://rutumulkar.com/blog/2015/word2vec/ — Aziz Alto, Nov 28 '15 at 22:33

score 25 · Answer 1 · answered May 31 '14 at 10:23

25

train() expects a sequence of sentences on input, not one sentence.
train() only updates weights for existing feature vectors based on existing vocabulary. You cannot add new vocabulary (=new feature vectors) using train().

answered May 31 '14 at 10:23

Radim

4,208
3
27
38

4

So how to add new vocabulary? It is definitively not possible? Thank you – Nacho May 23 '16 at 22:32
3

@Nacho, ["The word2vec algorithm doesn’t support adding new words dynamically."](http://rare-technologies.com/word2vec-tutorial/#comment-2281) So, no, it isn't possible unless you retrain the entire model with the new vocab. – Jason Aug 10 '16 at 20:41
3

WARNING: outdated answer so I'm downvoting this answer to allow the answer here https://stackoverflow.com/a/40936428/1019952 to go up. Or please update your answer. – Mohamed Taher Alrefaie Jun 06 '20 at 22:19

score 25 · Answer 2 · answered Dec 02 '16 at 16:07

25

As of gensim 0.13.3 it's possible to do online training of Word2Vec with gensim.

model.build_vocab(new_sentences, update=True)
model.train(new_sentences)

answered Dec 02 '16 at 16:07

Kamil Sindi

21,782
19
96
120

This does not actually work though for some reason. http://stackoverflow.com/questions/42357678/gensim-word2vec-array-dimensions-in-updating-with-online-word-embedding – chase Feb 22 '17 at 21:15
I haven't had issues implementing this. I'll try to take a look at your SO post this weekend. – Kamil Sindi Feb 22 '17 at 23:31
2

@chase I answered your SO post – Kamil Sindi Feb 23 '17 at 15:35

score 8 · Answer 3 · edited Mar 16 '15 at 13:18

If your model was generated using the C tool load_word2vec_format it is not possible to update that model. See the word2vec tutorial section on Online Training Word2Vec Tutorial:

Note that it’s not possible to resume training with models generated by the C tool, load_word2vec_format(). You can still use them for querying/similarity, but information vital for training (the vocab tree) is missing there.

score 2 · Answer 4 · answered Aug 19 '16 at 23:52

First of all, you cannot add new words to a pre-trained model's.

However, there's a "new" doc2vec model published in 2014 which meets all your requirement. You can use it to train a document vector instead of getting a set of word vector then combine them. The best part is doc2vec can infer unseen sentences after training. Although the model is still unchangable, you can get a pretty good inference result based on my experiment.

score 2 · Answer 5 · answered Oct 13 '16 at 00:10

2

Problem is that you can not retrain word2vec model with new Sentences. Only doc2vec allows it. Try doc2vec model.

answered Oct 13 '16 at 00:10

Nurul Akter Towhid

3,046
2
33
35

score 1 · Answer 6 · answered Sep 25 '19 at 01:49

1

You can add to the model vocabulary, and add to the embedding using FastText.

from gensim.models import FastText

Here you can see some FastText examples. Here you can see how to use FastText to score Out-of-vocabulary (OOV) instances.

answered Sep 25 '19 at 01:49

S.P.

41
5

Update gensim word2vec model

6 Answers6

Linked