gensim word2vec - updating word embeddings with newcoming data

Question

I have trained 26 million tweets with skipgram technique to create word embeddings as follows:

sentences = gensim.models.word2vec.LineSentence('/.../data/tweets_26M.txt')
model = gensim.models.word2vec.Word2Vec(sentences, window=2, sg=1, size=200, iter=20)
model.save_word2vec_format('/.../savedModel/Tweets26M_All.model.bin', binary=True)

However, I am continuously collecting more tweets in my database. For example, when I have 2 million more tweets, I wanna update my embeddings with also considering this newcoming 2M tweets.

Is it possible to load previously trained model and update weights of embeddings (maybe adding new word embeddings to my model)? Or do I need to 28 (26+2) million tweets from beginning? It takes 5 hours with current parameters and will take longer with a bigger data.

One other question, is it possible to retrieve sentences parameter directly from database (instead of reading it from txt, bz2 or gz files)? As our data to be trained is getting bigger, it would be better to bypassing text read/write operations.

Did you try loading the model `model = Word2Vec.load_word2vec_format('/.../savedModel/Tweets26M_All.model.bin', binary=True)` and just train it more with `model.train(more_sentences)` ? — Ziumin, Nov 21 '16 at 19:14
@Ziumin It gives an error as follows: `/anaconda/envs/tensorflow/lib/python2.7/site-packages/gensim/models/word2vec.pyc in train(self, sentences, total_words, word_count, total_examples, queue_factor, report_delay) 780 781 if total_words is None and total_examples is None: --> 782 if self.corpus_count: 783 total_examples = self.corpus_count 784 logger.info("expecting %i sentences, matching count from corpus used for vocabulary survey", total_examples) AttributeError: 'Word2Vec' object has no attribute 'corpus_count'` — Inanc Arin, Nov 21 '16 at 22:50
I think it might not be implemented this way but you can probably tweak it. There should be a step in implementation of training when the word vectors are initialized randomly. You might change this step so that instead of these random vectors you read the vectors from your model. However I don't really know how difficult it is to make these changes. There is also a problem what to do with new words in your vocabulary - considering there are probably some words there were not present in you previous corpora. — piko, Nov 23 '16 at 12:16
I think you are right @piko . Maybe I should switch back to tensorflow implementation of skipgram instead of using gensim. In this way, maybe I could do whatever I need. — Inanc Arin, Nov 23 '16 at 13:10
Possible duplicate of [Update gensim word2vec model](http://stackoverflow.com/questions/22121028/update-gensim-word2vec-model) — Kamil Sindi, Dec 02 '16 at 16:09

gensim word2vec - updating word embeddings with newcoming data

0 Answers0

Linked