There's no official support for removing words from a Gensim Word2Vec
model, once they've ever "made the cut" for inclusion.
Even the ability to add words isn't on a great footing, as the feature isn't based on any proven/published method of updating a Word2Vec
model, and glosses over difficult tradeoffs in how update-batches affect the model, via choice of learning-rate or whether the batches fully represent the existing vocabulary. The safest course is to regularly re-train the model from scratch, with a full corpus with sufficient examples of all relevant words.
So, my main suggestion would be to regularly replace your model with a new one trained with all still-relevant data. That would ensure it's no longer wasting model state on obsolete terms, and that all still-live terms have received coequal, interleaved training.
After such a reset, word-vectors won't be comparable to word-vectors from a prior 'model era'. (The same word, even if its tangible meaning hasn't changed, could be an arbitrarily different place - but the relative relationships with other vectors should remain as good or better.) But, that same sort of drift-out-of-comparison is also happening with any set of small-batch updates that don't 'touch' every existing word equally, just at some unquantifiable rate.
OTOH, if you think you need to stay with such incremental updates, even knowing the caveats, it's plausible that you could patch-up the model structures to retain as much as is sensible from the old model & continue training.
Your code so far is a reasonable start, missing a few important considerations for proper functionality:
- because deleting earlier-words changes the index location of later-words, you'd need to update the
vocab[word].index
values for every surviving word, to match the new index2word
ordering. For example, after doing all deletions, you might do:
for i, word in enumerate(m.wv.index2word):
m.wv.vocab[word].index = i
because in your (default negative-sampling) Word2Vec
model, there is also another array of per-word weights related to the model's output layer, that should also be updated in sync, so that the right output-values are being checked per word. Roughly, wheneever you delete a row from m.wv.vectors
, you should delete the same row from m.traininables.syn1neg
.
because the surviving vocabulary has different relative word-frequencies, both the negative-sampling and downsampling (controlled by the sample
parameter) functions should work off different pre-calculated structures to assist their choices. For the cumulative-distribution table used by negative-sampling, this is pretty easy:
m.make_cum_table(m.wv)
For the downsampling, you'd want to update the .sample_int
values similar to the logic you can view around the code at https://github.com/RaRe-Technologies/gensim/blob/3.8.3/gensim/models/word2vec.py#L1534. (But, looking at that code now, I think it may be buggy in that it's updating all words with just the frequency info in the new dict, so probably fouling the usual downsampling of truly-frequent words, and possibly erroneously downsampling words that are only frequent in the new update.)
If those internal structures are updated properly in sync with your existing actions, the model is probably in a consistent state for further training. (But note: these structures change a lot in the forthcoming gensim-4.0.0
release, so any custom tampering will need to be updated when upgrading then.)
One other efficiency note: the np.delete()
operation will create a new array, the full size of the surviving array, and copy the old values over, each time it is called. So using it to remove many rows, one at a time, from a very-large original array is likely to require a lot of redundant allocation/copying/garbage-collection. You may be able to call it once, at the end, with a list of all indexes to remove.
But really: the simpler & better-grounded approach, which may also yield significantly better continually-comparable vectors, would be to retrain with all current data whenever possible or a large amount of change has occurred.