0

I've got a problem with online updating my Word2Vec model.

I have a document and build model by it. But this document can update with new words, and I need to update vocabulary and model in general.

I know that in gensim 0.13.4.1 we can do this

My code:

model = gensim.models.Word2Vec(size=100, window=10, min_count=5, workers=11, alpha=0.025, min_alpha=0.025, iter=20)
model.build_vocab(sentences, update=False)

model.train(sentences, epochs=model.iter, total_examples=model.corpus_count)

model.save('model.bin')

And after this I have new words. For e.x.:

sen2 = [['absd', 'jadoih', 'sdohf'], ['asdihf', 'oisdh', 'oiswhefo'], ['a', 'v', 'b', 'c'], ['q', 'q', 'q']]

model.build_vocab(sen2, update=True)
model.train(sen2, epochs=model.iter, total_examples=model.corpus_count)

What's wrong and how can I solve my problem?

ctrlaltdel
  • 145
  • 1
  • 2
  • 7
  • You're not showing any error, or explaining what you tried and got a result different than what was expected. So what exactly is the problem? – gojomo Dec 04 '18 at 10:14
  • @gojomo I have a result different what I expected. E.x. after first training model vocabulary size = 597. And after re-train (I expect that it added 11 new words from sen2) vocabulary size again = 597 – ctrlaltdel Dec 04 '18 at 10:18
  • https://stackoverflow.com/questions/22121028/update-gensim-word2vec-model I've seen this, but parameter "ipdate=True" didn't help for me – ctrlaltdel Dec 04 '18 at 10:25

1 Answers1

2

Your model is set to ignore words with fewer than 5 occurrences: min_count=5. It will, in fact, require at least 5 occurrences in a single build_vocab() call. (It won't remember there were 3 before, then see 2 new occurrences, then train on all 5. It needs all 5 or more in one batch.)

If you're only calling your update with the tiny dataset shown, no new words will make the cut.

More generally, if at all possible, you should retrain the whole model with all old and new data. That will ensure equal influence is given to old and new words, and any words are treated properly according to their combined frequency. Making small incremental updates to a Word2Vec model risks pulling newer words, or old words that continue to reappear, out of meaningful arrangement with older words that were only trained in the original (or earlier) batches. (Only words that go through the same interleaved training cycles are fully positionally adjusted with respect to each other.)

gojomo
  • 52,260
  • 14
  • 86
  • 115
  • Oh, really a skip it.. Thank you! I try to solve a big task and w2v is one of solutions to it. Maybe you can help me, if you'll see this topic? You know that re-train all model (if there're a lot of documents and new words is very irrational..) https://stackoverflow.com/questions/53607110/creating-vector-space – ctrlaltdel Dec 04 '18 at 10:39