5

I am using gensim library for loading pre-trained word vectors from GoogleNews dataset. this dataset contains 3000000 word vectors each of 300 dimensions. when I want to load GoogleNews dataset, I receive a memory error. I have tried this code before without memory error and I don't know why I receive this error now. I have checked a lot of sites for solving this issue but I cant understand. this is my code for loading GoogleNews:

import gensim.models.keyedvectors as word2vec
model=word2vec.KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300.bin",binary=True)

and this is the error I received:

File "/home/mahsa/PycharmProjects/tensor_env_project/word_embedding_DUC2007/inspect_word2vec-master/word_embeddings_GoogleNews.py", line 8, in <module>
    model=word2vec.KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300.bin",binary=True)
  File "/home/mahsa/anaconda3/envs/tensorflow_env/lib/python3.5/site-packages/gensim/models/keyedvectors.py", line 212, in load_word2vec_format
    result.syn0 = zeros((vocab_size, vector_size), dtype=datatype)
MemoryError

can anybody help me? thanks.

Mahsa
  • 581
  • 1
  • 9
  • 28
  • Are you using 32-bit python or 64-bit python? –  May 23 '18 at 00:06
  • I checked it by `import platform platform.architecture()` and the result was 64 bit python. – Mahsa May 23 '18 at 07:06
  • I updated `gensim`, `numpy` and `scipy` packages and restart my computer. and now I don't have that problem. but I don't know what happened. thanks anyway. – Mahsa May 23 '18 at 08:36

4 Answers4

7

Loading just the raw vectors will take...

3,000,000 words * 300 dimensions * 4 bytes/dimension = 3.6GB

...of addressable memory (plus some overhead for the word-key to index-position map).

Additionally, as soon as you want to do a most_similar()-type operation, unit-length normalized versions of the vectors will be created – which will require another 3.6GB. (You may instead clobber the raw vectors in place, saving that extra memory, if you'll only be doing cosine-similarity comparisons between the unit-normed vectors, by 1st doing a forced explicit model.init_sims(replace=True).)

So you'll generally only want to do full operations on a machine with at least 8GB of RAM. (Any swapping at all during full-array most_similar() lookups will make operations very slow.)

If anything else was using Python heap space, that could have accounted for the MemoryError you saw.

The load_word2vec_format() method also has an optional limit argument which will only load the supplied number of vectors – so you could use limit=500000 to cut the memory requirements by about 5/6ths. (And, since the GoogleNews and other vector sets are usually ordered from most- to least-frequent words, you'll get the 500K most-frequent words. Lower-frequency words generally have much less value and even not-as-good vectors, so it may not hurt much to ignore them.)

gojomo
  • 52,260
  • 14
  • 86
  • 115
  • Silly question, but I want to use this Google News model over various different files on my laptop. This means I will be running this line over and over again in different Jupyter notebooks: model=word2vec.KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300.bin",binary=True) Does this eat 1) Storage (I've noticed my storage filling up exponentially for no reason) and 2) memory if I close the previous notebook before running the next. – user8322222 Feb 27 '19 at 17:44
  • 1
    Just loading a model won't usually use any more disk storage. (An exception: if load or use needs addressable memory beyond your RAM, you may start using virtual memory, which might show up as less-disk space depending on your OS. But, with these sorts of models, you want to avoid using any virtual memory, as basic `most_similar()` ops cycle through the full model, & will be very slow.) Loading the model will use memory, then more when 1st doing `most_similar()` – but terminating a notebook should free that memory. (Note closing a tab may not cleanly terminate a Jupyter notebook.) – gojomo Feb 28 '19 at 01:16
  • Thanks for clarifying! – user8322222 Feb 28 '19 at 09:23
1

To load the whole model one needs a bigger RAM.

You may use the following code. Set the limit to which your system can take. It'll load vectors that are at top of the file.

from gensim import models

w = models.KeyedVectors.load_word2vec_format(r"GoogleNews-vectors-negative300.bin.gz", binary=True, limit = 100000)

I set the limit as 100,000. It worked on my 4GB RAM laptop.

0

Try closing all your browser tabs and everything else that is eating up RAM. For me that worked.

0

You should increase the RAM it would work

  • 1
    Although you are correct, your answer would be a bit more helpful for the poster if you could clarify what you mean by "increase the ram" and also explain to them more of what is happening. – Brian Ecker Apr 17 '20 at 06:25
  • Is there any method so we can generalize another model from the pretrained word2vec model so we can predict similar items without heavy usage of RAM ? – Hassan Abbas Apr 18 '20 at 04:50