gensim word2vec: Find number of words in vocabulary

Question

After training a word2vec model using python gensim, how do you find the number of words in the model's vocabulary?

gojomo · Accepted Answer · 2021-10-20T17:52:46.863

In recent versions, the model.wv property holds the words-and-vectors, and can itself can report a length – the number of words it contains. So if w2v_model is your Word2Vec (or Doc2Vec or FastText) model, it's enough to just do:

vocab_len = len(w2v_model.wv)

If your model is just a raw set of word-vectors, like a KeyedVectors instance rather than a full Word2Vec/etc model, it's just:

vocab_len = len(kv_model)

Other useful internals in Gensim 4.0+ include model.wv.index_to_key, a plain list of the key (word) in each index position, and model.wv.key_to_index, a plain dict mapping keys (words) to their index positions.

In pre-4.0 versions, the vocabulary was in the vocab field of the Word2Vec model's wv property, as a dictionary, with the keys being each token (word). So there it was just the usual Python for getting a dictionary's length:

len(w2v_model.wv.vocab)

In very-old gensim versions before 0.13 vocab appeared directly on the model. So way back then you would use w2v_model.vocab instead of w2v_model.wv.vocab.

But if you're still using anything from before Gensim 4.0, you should definitely upgrade! There are big memory & performance improvements, and the changes required in calling code are relatively small – some renamings & moves, covered in the 4.0 Migration Notes.

Indeed see the Gensim 4 migration notes: https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4#4-vocab-dict-became-key_to_index-for-looking-up-a-keys-integer-index-or-get_vecattr-and-set_vecattr-for-other-per-key-attributes — gojomo, Jan 28 '21 at 00:59
The vocab attribute was removed from KeyedVector in Gensim 4.0.0. Use KeyedVector's .key_to_index dict, .index_to_key list, and methods .get_vecattr(key, attr) and .set_vecattr(key, attr, new_val) instead. — Tom J, Nov 24 '22 at 00:06

score 8 · Answer 2 · answered Feb 26 '19 at 18:27

One more way to get the vocabulary size is from the embedding matrix itself as in:

In [33]: from gensim.models import Word2Vec

# load the pretrained model
In [34]: model = Word2Vec.load(pretrained_model)

# get the shape of embedding matrix    
In [35]: model.wv.vectors.shape
Out[35]: (662109, 300)

# `vocabulary_size` is just the number of rows (i.e. axis 0)
In [36]: model.wv.vectors.shape[0]
Out[36]: 662109

score 5 · Answer 3 · answered Jun 22 '21 at 12:22

5

Gojomo's answer raises an AttributeError for Gensim 4.0.0+.

For these versions, you can get the length of the vocabulary as follows:

len(w2v_model.wv.index_to_key)

(which is slightly faster than: len(w2v_model.wv.key_to_index))

answered Jun 22 '21 at 12:22

Emil

1,531
3
22
47

Vasanth Rohith · Answer 4 · 2023-05-23T06:04:26.747

1

AttributeError: The vocab attribute was removed from KeyedVector in Gensim 4.0.0. as the error says we can go for .key_to_index dict, .index_to_key list

model = Word2Vec(text,min_count=1)

words=model.wv.index_to_key instead of vocab

Here the number of words in a alphabetic order. example:

a <- army,africa,agreement. count of the words in a is 3

model.wv["a"] to see the array/co-ordinates.

Hope this helps...

edited May 23 '23 at 06:04

answered May 23 '23 at 06:03

Vasanth Rohith

11
3

Snehil · Answer 5 · 2022-07-26T16:47:23.040

0

Latest:

Use model.wv.key_to_index, after creating gensim model

vocab dict became key_to_index for looking up a key's integer index, or get_vecattr() and set_vecattr() for other per-key attributes:https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4#4-vocab-dict-became-key_to_index-for-looking-up-a-keys-integer-index-or-get_vecattr-and-set_vecattr-for-other-per-key-attributes

edited Jul 26 '22 at 16:47

answered Jul 26 '22 at 16:46

Snehil

1
2

gensim word2vec: Find number of words in vocabulary

5 Answers5

Linked