What is the best way to drop old "words" from gensim word2vec model?

Question

I have a "corpus" built from an item-item graph, which means each sentence is a graph walk path and each word is an item. I want to train a word2vec model upon the corpus to obtain items' embedding vectors. The graph is updated everyday so the word2vec model is trained in an increased way (using Word2Vec.save() and Word2Vec.load()) to keep updating the items' vectors.

Unlike words, the items in my corpus have their lifetime and there will be new items added in everyday. In order to prevent the constant growth of the model size, I need to drop items that reached their lifetime while keep the model trainable. I've read the similar question here, but this question's answer doesn't related to increased-training and is based on KeyedVectors. I come up with the below code, but I'm not sure if it is correct and proper:

from gensim.models import Word2Vec
import numpy as np

texts = [["a", "b", "c"], ["a", "h", "b"]]
m = Word2Vec(texts, size=5, window=5, min_count=1, workers=1)

print(m.wv.index2word)
print(m.wv.vectors)

# drop old words
wordsToDrop = ["b", "c"]
for w in wordsToDrop:
    i = m.wv.index2word.index(w)
    m.wv.index2word.pop(i)
    m.wv.vectors = np.delete(m.wv.vectors, i, axis=0)
    del m.wv.vocab[w]

print(m.wv.index2word)
print(m.wv.vectors)
m.save("m.model")
del m

# increased training
new = [["a", "e", "n"], ["r", "s"]]
m = Word2Vec.load("m.model")
m.build_vocab(new, update=True)
m.train(new, total_examples=m.corpus_count, epochs=2)
print(m.wv.index2word)
print(m.wv.vectors)

After deleting and increased training, is the m.wv.index2word and m.wv.vectors still element-wise corresponding? Is there any side-effect of above code? If my way is not good, could someone give me an example to show how to drop the old "words" properly and keep the model trainable?

score 1 · Accepted Answer · answered Sep 16 '20 at 19:11

There's no official support for removing words from a Gensim Word2Vec model, once they've ever "made the cut" for inclusion.

Even the ability to add words isn't on a great footing, as the feature isn't based on any proven/published method of updating a Word2Vec model, and glosses over difficult tradeoffs in how update-batches affect the model, via choice of learning-rate or whether the batches fully represent the existing vocabulary. The safest course is to regularly re-train the model from scratch, with a full corpus with sufficient examples of all relevant words.

So, my main suggestion would be to regularly replace your model with a new one trained with all still-relevant data. That would ensure it's no longer wasting model state on obsolete terms, and that all still-live terms have received coequal, interleaved training.

After such a reset, word-vectors won't be comparable to word-vectors from a prior 'model era'. (The same word, even if its tangible meaning hasn't changed, could be an arbitrarily different place - but the relative relationships with other vectors should remain as good or better.) But, that same sort of drift-out-of-comparison is also happening with any set of small-batch updates that don't 'touch' every existing word equally, just at some unquantifiable rate.

OTOH, if you think you need to stay with such incremental updates, even knowing the caveats, it's plausible that you could patch-up the model structures to retain as much as is sensible from the old model & continue training.

Your code so far is a reasonable start, missing a few important considerations for proper functionality:

because deleting earlier-words changes the index location of later-words, you'd need to update the vocab[word].index values for every surviving word, to match the new index2word ordering. For example, after doing all deletions, you might do:

for i, word in enumerate(m.wv.index2word):
    m.wv.vocab[word].index = i

because in your (default negative-sampling) Word2Vec model, there is also another array of per-word weights related to the model's output layer, that should also be updated in sync, so that the right output-values are being checked per word. Roughly, wheneever you delete a row from m.wv.vectors, you should delete the same row from m.traininables.syn1neg.
because the surviving vocabulary has different relative word-frequencies, both the negative-sampling and downsampling (controlled by the sample parameter) functions should work off different pre-calculated structures to assist their choices. For the cumulative-distribution table used by negative-sampling, this is pretty easy:

m.make_cum_table(m.wv)

For the downsampling, you'd want to update the .sample_int values similar to the logic you can view around the code at https://github.com/RaRe-Technologies/gensim/blob/3.8.3/gensim/models/word2vec.py#L1534. (But, looking at that code now, I think it may be buggy in that it's updating all words with just the frequency info in the new dict, so probably fouling the usual downsampling of truly-frequent words, and possibly erroneously downsampling words that are only frequent in the new update.)

If those internal structures are updated properly in sync with your existing actions, the model is probably in a consistent state for further training. (But note: these structures change a lot in the forthcoming gensim-4.0.0 release, so any custom tampering will need to be updated when upgrading then.)

One other efficiency note: the np.delete() operation will create a new array, the full size of the surviving array, and copy the old values over, each time it is called. So using it to remove many rows, one at a time, from a very-large original array is likely to require a lot of redundant allocation/copying/garbage-collection. You may be able to call it once, at the end, with a list of all indexes to remove.

But really: the simpler & better-grounded approach, which may also yield significantly better continually-comparable vectors, would be to retrain with all current data whenever possible or a large amount of change has occurred.

Great answer. However, we can't make sure that the retrained model can produce "continually-comparable vectors" mentioned in the last paragraph, right? The main cons for retraining the model is that, once the vectors is reset, the downstream task (using embedding result for downstream tasks is quite common) model(s) will also need to be retrained. — YQ.Wang, Sep 17 '20 at 05:02
Yes - doing a fresh-from-scratch word-vector model means there may be no correlations with a prior era's coordinate-space. Thus cached downstream calculations/models based on that old space should be refreshed. But per above, even w/ incremental training, words untrained by incremental batches, compared to those retrained (or added), can *also* drift arbitrarily out of meaningful relative arrangements learned by earlier training. It's just more subtle, & highly dependent on the exact mix of training data & parameter choices. Beware assuming incremental training keeps things compatible/valid. — gojomo, Sep 17 '20 at 06:42

What is the best way to drop old "words" from gensim word2vec model?

1 Answers1