27

I'm building a chatbot so I need to vectorize the user's input using Word2Vec.

I'm using a pre-trained model with 3 million words by Google (GoogleNews-vectors-negative300).

So I load the model using Gensim:

import gensim
model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

The problem is that it takes about 2 minutes to load the model. I can't let the user wait that long.

So what can I do to speed up the load time?

I thought about putting each of the 3 million words and their corresponding vector into a MongoDB database. That would certainly speed things up but intuition tells me it's not a good idea.

gojomo
  • 52,260
  • 14
  • 86
  • 115
Marcus Holm
  • 417
  • 3
  • 7
  • 15

4 Answers4

64

In recent gensim versions you can load a subset starting from the front of the file using the optional limit parameter to load_word2vec_format(). (The GoogleNews vectors seem to be in roughly most- to least- frequent order, so the first N are usually the N-sized subset you'd want. So use limit=500000 to get the most-frequent 500,000 words' vectors – still a fairly large vocabulary – saving 5/6ths of the memory/load-time.)

So that may help a bit. But if you're re-loading for every web-request, you'll still be hurting from loading's IO-bound speed, and the redundant memory overhead of storing each re-load.

There are some tricks you can use in combination to help.

Note that after loading such vectors in their original word2vec.c-originated format, you can re-save them using gensim's native save(). If you save them uncompressed, and the backing array is large enough (and the GoogleNews set is definitely large enough), the backing array gets dumped in a separate file in a raw binary format. That file can later be memory-mapped from disk, using gensim's native [load(filename, mmap='r')][1] option.

Initially, this will make the load seem snappy – rather than reading all the array from disk, the OS will just map virtual address regions to disk data, so that some time later, when code accesses those memory locations, the necessary ranges will be read-from-disk. So far so good!

However, if you are doing typical operations like most_similar(), you'll still face big lags, just a little later. That's because this operation requires both an initial scan-and-calculation over all the vectors (on first call, to create unit-length-normalized vectors for every word), and then another scan-and-calculation over all the normed vectors (on every call, to find the N-most-similar vectors). Those full-scan accesses will page-into-RAM the whole array – again costing the couple-of-minutes of disk IO.

What you want is to avoid redundantly doing that unit-normalization, and to pay the IO cost just once. That requires keeping the vectors in memory for re-use by all subsequent web requestes (or even multiple parallel web requests). Fortunately memory-mapping can also help here, albeit with a few extra prep steps.

First, load the word2vec.c-format vectors, with load_word2vec_format(). Then, use model.init_sims(replace=True) to force the unit-normalization, destructively in-place (clobbering the non-normalized vectors).

Then, save the model to a new filename-prefix: model.save('GoogleNews-vectors-gensim-normed.bin'`. (Note that this actually creates multiple files on disk that need to be kept together for the model to be re-loaded.)

Now, we'll make a short Python program that serves to both memory-map load the vectors, and force the full array into memory. We also want this program to hang until externally terminated (keeping the mapping alive), and be careful not to re-calculate the already-normed vectors. This requires another trick because the loaded KeyedVectors actually don't know that the vectors are normed. (Usually only the raw vectors are saved, and normed versions re-calculated whenever needed.)

Roughly the following should work:

from gensim.models import KeyedVectors
from threading import Semaphore
model = KeyedVectors.load('GoogleNews-vectors-gensim-normed.bin', mmap='r')
model.syn0norm = model.syn0  # prevent recalc of normed vectors
model.most_similar('stuff')  # any word will do: just to page all in
Semaphore(0).acquire()  # just hang until process killed

This will still take a while, but only needs to be done once, before/outside any web requests. While the process is alive, the vectors stay mapped into memory. Further, unless/until there's other virtual-memory pressure, the vectors should stay loaded in memory. That's important for what's next.

Finally, in your web request-handling code, you can now just do the following:

model = KeyedVectors.load('GoogleNews-vectors-gensim-normed.bin', mmap='r')
model.syn0norm = model.syn0  # prevent recalc of normed vectors
# … plus whatever else you wanted to do with the model

Multiple processes can share read-only memory-mapped files. (That is, once the OS knows that file X is in RAM at a certain position, every other process that also wants a read-only mapped version of X will be directed to re-use that data, at that position.).

So this web-reqeust load(), and any subsequent accesses, can all re-use the data that the prior process already brought into address-space and active-memory. Operations requiring similarity-calcs against every vector will still take the time to access multiple GB of RAM, and do the calculations/sorting, but will no longer require extra disk-IO and redundant re-normalization.

If the system is facing other memory pressure, ranges of the array may fall out of memory until the next read pages them back in. And if the machine lacks the RAM to ever fully load the vectors, then every scan will require a mixing of paging-in-and-out, and performance will be frustratingly bad not matter what. (In such a case: get more RAM or work with a smaller vector set.)

But if you do have enough RAM, this winds up making the original/natural load-and-use-directly code "just work" in a quite fast manner, without an extra web service interface, because the machine's shared file-mapped memory functions as the service interface.

gojomo
  • 52,260
  • 14
  • 86
  • 115
  • Is this method still valid? I see that init_sims() is deprecated – ciropom Jun 29 '23 at 08:34
  • In recent versions of Gensim, optimizations mean that simply loading any `KeyedVectors` with `KeyedVectors.load(filename, mmp='r')` should get essentially all the same memory-reuse benefits of this approach, *without* needing to do any other explicit norming, forced replacement, or manual patching of internal attributes (like `syn0`/`syn0norm`, which have also either been renamed or removed). So: just make sure your vectors are native-saved as `KeyedVectors`, then do the `mmap`-enabled load, without any other tampering. – gojomo Jun 29 '23 at 16:59
6

I really love vzhong's Embedding library. https://github.com/vzhong/embeddings

It stores word vectors in SQLite which means we don't need to load model but just fetch corresponding vectors from DB :D

Hyeungshik Jung
  • 309
  • 3
  • 5
4

Success method:

model = Word2Vec.load_word2vec_format('wikipedia-pubmed-and-PMC-w2v.bin',binary=True)
model.init_sims(replace=True)
model.save('bio_word')

later load the model

Word2Vec.load('bio_word',mmap='r')

for more info: https://groups.google.com/forum/#!topic/gensim/OvWlxJOAsCo

Nicomedes E.
  • 1,326
  • 5
  • 18
  • 27
  • 1
    Thanks for answering, for more understanding please format your answer. And also try to explain why the answer is the solution. Please check [the guidelines on answering](https://stackoverflow.com/help/how-to-answer) – eenagy Jul 10 '19 at 05:03
  • 1
    Such a trivial solution to the problem! – Bendemann Aug 25 '20 at 20:13
  • In older versions of Gensim (<4.0) where `init_sims()` does something relevant, this *won't* work to ensure full sharing between processes, because each process creates its own redundant unshared unit-normed array (unless using tricks in another answer). Also, generally in both older & newer Gensim, if you're only reading pretrained vectors, `KeyedVectors` rather than a full `Word2Vec` model (w/ extra overhead) should be used. But in recent versions (4.0+), just loading saved `KeyedVectors` with the `.load(filename, mmap='r')` should work to achieve good sharing - no `.init_sims()`/etc needed. – gojomo Jun 29 '23 at 17:13
3

I have that problem whenever I use the google news dataset. The issue is there are way more words in the dataset than you'll ever need. There are a huge amount of typos and what not. What I do is scan the data I'm working on, build a dictionary of say the 50k most common words, get the vectors with Gensim and save the dictionary. Loading this dictionary takes half a second instead of 2 minutes.

If you have no specific dataset, you could use the 50 or 100k most common words from a big dataset, such as a news dataset from WMT to get you started.

Other options are to always keep Gensim running. You can create a FIFO for a script running Gensim. The script acts like a "server" that can read a file to which a "client" writes, watching for vector requests.

I think the most elegant solution is to run a web service providing word embeddings. Check out the word2vec API as an example. After installing, getting the embedding for "restaurant" is as simple as:

curl http://127.0.0.1:5000/word2vec/model?word=restaurant
vega
  • 2,150
  • 15
  • 21
  • Perfect! Those are some great ideas and exactly the kind of solution I was looking for. Do you see any potential pitfalls with the database solution? If so I could always build a dictionary of the most common words (50-100k) as you suggested, but I'd rather have all 3 million of them if that's feasible. – Marcus Holm Mar 26 '17 at 02:33
  • Ok I just remembered the word2vec as web service solution. So it's a little like the FIFO concept but much better. You simply call the script with a web request. So you can have the 3 million words without putting them in a db, and the script is always running so no loading time. The repo even gives you links to a bunch of embeddings. – vega Mar 26 '17 at 11:38