How to get vocabulary word count from gensim word2vec?

Question

I am using gensim word2vec package in python. I know how to get the vocabulary from the trained model. But how to get the word count for each word in vocabulary?

Does this answer your question? [gensim word2vec: Find number of words in vocabulary](https://stackoverflow.com/questions/35596031/gensim-word2vec-find-number-of-words-in-vocabulary) — Abu Shoeb, Nov 23 '20 at 06:25
Since gensim v4, you should consider change the accepted answer, @Makan provides a good update. — pietrodito, Jun 19 '22 at 06:37

score 32 · Accepted Answer · answered Jun 23 '16 at 15:05

32

Each word in the vocabulary has an associated vocabulary object, which contains an index and a count.

vocab_obj = w2v.vocab["word"]
vocab_obj.count

Output for google news w2v model: 2998437

So to get the count for each word, you would iterate over all words and vocab objects in the vocabulary.

for word, vocab_obj in w2v.vocab.items():
  #Do something with vocab_obj.count

answered Jun 23 '16 at 15:05

user3390629

854
10
17

16

As of [`gensim` 1.0.0](https://github.com/RaRe-Technologies/gensim/releases/tag/1.0.0), you need to do `w2v.wv.vocab["word"].count` instead of `w2v.vocab["word"].count`. – Adam Liter Jun 21 '17 at 14:53
2

Just to clarify, word count ≠ word frequency. – brienna Apr 22 '18 at 23:24
@aucamort would you explain a bit what do you mean by `word count ≠ word frequency`? This answer (https://stackoverflow.com/a/55659539/6907424) seems to be counter-intuitive w.r.to what you have said. – hafiz031 Jul 11 '21 at 07:47

score 8 · Answer 2 · answered Mar 16 '21 at 21:30

8

The vocab attribute was removed from KeyedVector in Gensim 4.0.0.

Instead:

word2vec_model.wv.get_vecattr("my-word", "count")  # returns count of "my-word"
len(word2vec_model.wv)  # returns size of the vocabulary

Check out notes on migrating from Gensim 3.x to 4

answered Mar 16 '21 at 21:30

Makan

547
6
8

Upvoted, this how to do it since v4 – pietrodito Jun 19 '22 at 06:40

score 5 · Answer 3 · answered Nov 09 '18 at 12:35

When you want to create a dictionary of word to count for easy retrieval later, you can do so as follows:

w2c = dict()
for item in model.wv.vocab:
    w2c[item]=model.wv.vocab[item].count

If you want to sort it to see the most frequent words in the model, you can also do that so:

w2cSorted=dict(sorted(w2c.items(), key=lambda x: x[1],reverse=True))

How to get vocabulary word count from gensim word2vec?

3 Answers3

Linked