I am using gensim word2vec package in python. I know how to get the vocabulary from the trained model. But how to get the word count for each word in vocabulary?
Asked
Active
Viewed 3.3k times
15
-
Does this answer your question? [gensim word2vec: Find number of words in vocabulary](https://stackoverflow.com/questions/35596031/gensim-word2vec-find-number-of-words-in-vocabulary) – Abu Shoeb Nov 23 '20 at 06:25
-
Since gensim v4, you should consider change the accepted answer, @Makan provides a good update. – pietrodito Jun 19 '22 at 06:37
3 Answers
32
Each word in the vocabulary has an associated vocabulary object, which contains an index and a count.
vocab_obj = w2v.vocab["word"]
vocab_obj.count
Output for google news w2v model: 2998437
So to get the count for each word, you would iterate over all words and vocab objects in the vocabulary.
for word, vocab_obj in w2v.vocab.items():
#Do something with vocab_obj.count

user3390629
- 854
- 10
- 17
-
16As of [`gensim` 1.0.0](https://github.com/RaRe-Technologies/gensim/releases/tag/1.0.0), you need to do `w2v.wv.vocab["word"].count` instead of `w2v.vocab["word"].count`. – Adam Liter Jun 21 '17 at 14:53
-
2
-
@aucamort would you explain a bit what do you mean by `word count ≠ word frequency`? This answer (https://stackoverflow.com/a/55659539/6907424) seems to be counter-intuitive w.r.to what you have said. – hafiz031 Jul 11 '21 at 07:47
8
The vocab attribute was removed from KeyedVector in Gensim 4.0.0.
Instead:
word2vec_model.wv.get_vecattr("my-word", "count") # returns count of "my-word"
len(word2vec_model.wv) # returns size of the vocabulary
Check out notes on migrating from Gensim 3.x to 4

Makan
- 547
- 6
- 8
5
When you want to create a dictionary of word to count for easy retrieval later, you can do so as follows:
w2c = dict()
for item in model.wv.vocab:
w2c[item]=model.wv.vocab[item].count
If you want to sort it to see the most frequent words in the model, you can also do that so:
w2cSorted=dict(sorted(w2c.items(), key=lambda x: x[1],reverse=True))

Ahmadov
- 1,567
- 5
- 31
- 48