How to make word2vec model's loading time and memory use more efficient?

Question

I want to use Word2vec in a web server (production) in two different variants where I fetch two sentences from the web and compare it in real-time. For now, I am testing it on a local machine which has 16GB RAM.

Scenario: w2v = load w2v model

If condition 1 is true:
   if normalize:
      reverse normalize by w2v.init_sims(replace=False) (not sure if it will work)
   Loop through some items:
   calculate their vectors using w2v
else if condition 2 is true:
   if not normalized:
       w2v.init_sims(replace=True)
   Loop through some items:
   calculate their vectors using w2v

I have already read the solution about reducing the vocabulary size to a small size but I would like to use all the vocabulary.

Are there new workarounds on how to handle this? Is there a way to initially load a small portion of the vocabulary for first 1-2 minutes and in parallel keep loading the whole vocabulary?

I don't know about your web server, but i am sure in production, you need to open the file only once and keep the vectors in memory instead of reinitialize it on every new session. You don't have it in your code here. So i cannot guess what kinds of solutions are there. — Mehdi, Jul 19 '17 at 12:50
I am already doing that with a global variable across the project where I set a flag when the model is initialized for the first time. However, I am more interested in how to reduce the loading time for the first time. — utengr, Jul 19 '17 at 13:10
In this case the loading time for first time means when you start the server, right? — Mehdi, Jul 19 '17 at 13:15
yes. At that point, it also becomes important to use the memory in an efficient way because this model takes a couple of GBs in memory. — utengr, Jul 19 '17 at 13:22
I am not sure about this: you can check easily: if you have float64 you can reduce the memory size by make it float32. It will reduce the accuracy though. Reduceing the vocabulary size is the best solution. Take it this way: if a word happens only once in corpus it has a random initialisation vector in word2vec. — Mehdi, Jul 19 '17 at 13:34

score 2 · Accepted Answer · answered Jul 19 '17 at 19:41

As a one-time delay that you should be able to schedule to happen before any service-requests, I would recommend against worrying too much about the first-time load() time. (It's going to inherently take a lot of time to load a lot of data from disk to RAM – but once there, if it's being kept around and shared between processes well, the cost is not spent again for an arbitrarily long service-uptime.)

It doesn't make much sense to "load a small portion of the vocabulary for first 1-2 minutes and in parallel keep loading the whole vocabulary" – as soon as any similarity-calc is needed, the whole set of vectors need to be accessed for any top-N results. (So the "half-loaded" state isn't very useful.)

Note that if you do init_sims(replace=True), the model's original raw vector magnitudes are clobbered with the new unit-normed (all-same-magnitude) vectors. So looking at your pseudocode, the only difference between the two paths is the explicit init_sims(replace=True). But if you're truly keeping the same shared model in memory between requests, as soon as condition 2 occurs, the model is normalized, and thereafter calls under condition 1 are also occurring with normalized vectors. And further, additional calls under condition 2 just redundantly (and expensively) re-normalize the vectors in-place. So if normalized-comparisons are your only focus, best to do one in-place init_sims(replace=True) at service startup - not at the mercy of order-of-requests.

If you've saved the model using gensim's native save() (rather than save_word2vec_format()), and as uncompressed files, there's the option to 'memory-map' the files on a future re-load. This means rather than immediately copying the full vector array into RAM, the file-on-disk is simply marked as providing the addressing-space. There are two potential benefits to this: (1) if you only even access some limited ranges of the array, only those are loaded, on demand; (2) many separate processes all using the same mapped files will automatically reuse any shared ranges loaded into RAM, rather than potentially duplicating the same data.

(1) isn't much of an advantage as soon as you need to do a full-sweep over the whole vocabulary – because they're all brought into RAM then, and further at the moment of access (which will have more service-lag than if you'd just pre-loaded them). But (2) is still an advantage in multi-process webserver scenarios. There's a lot more detail on how you might use memory-mapped word2vec models efficiently in a prior answer of mine, at How to speed up Gensim Word2vec model load time?

it was more of code mistake. I have adopted the code. I am still not sure if I can reverse the normalize function because I want to have both normalized as well as a non-normalized model. I can initialize two different models but that is very inefficient. — utengr, Jul 24 '17 at 09:54
*Unless* you use the non-default `replace=True`, the single model will retain both raw and normalized vectors... but cosine-similarities (by definition) only operate on the unit-normalized vectors. — gojomo, Jul 24 '17 at 19:32

How to make word2vec model's loading time and memory use more efficient?

1 Answers1