15

I am using gensim word2vec package in python. I know how to get the vocabulary from the trained model. But how to get the word count for each word in vocabulary?

Michelle Owen
  • 361
  • 1
  • 3
  • 11
  • Does this answer your question? [gensim word2vec: Find number of words in vocabulary](https://stackoverflow.com/questions/35596031/gensim-word2vec-find-number-of-words-in-vocabulary) – Abu Shoeb Nov 23 '20 at 06:25
  • Since gensim v4, you should consider change the accepted answer, @Makan provides a good update. – pietrodito Jun 19 '22 at 06:37

3 Answers3

32

Each word in the vocabulary has an associated vocabulary object, which contains an index and a count.

vocab_obj = w2v.vocab["word"]
vocab_obj.count

Output for google news w2v model: 2998437

So to get the count for each word, you would iterate over all words and vocab objects in the vocabulary.

for word, vocab_obj in w2v.vocab.items():
  #Do something with vocab_obj.count
user3390629
  • 854
  • 10
  • 17
  • 16
    As of [`gensim` 1.0.0](https://github.com/RaRe-Technologies/gensim/releases/tag/1.0.0), you need to do `w2v.wv.vocab["word"].count` instead of `w2v.vocab["word"].count`. – Adam Liter Jun 21 '17 at 14:53
  • 2
    Just to clarify, word count ≠ word frequency. – brienna Apr 22 '18 at 23:24
  • @aucamort would you explain a bit what do you mean by `word count ≠ word frequency`? This answer (https://stackoverflow.com/a/55659539/6907424) seems to be counter-intuitive w.r.to what you have said. – hafiz031 Jul 11 '21 at 07:47
8

The vocab attribute was removed from KeyedVector in Gensim 4.0.0.

Instead:

word2vec_model.wv.get_vecattr("my-word", "count")  # returns count of "my-word"
len(word2vec_model.wv)  # returns size of the vocabulary

Check out notes on migrating from Gensim 3.x to 4

Makan
  • 547
  • 6
  • 8
5

When you want to create a dictionary of word to count for easy retrieval later, you can do so as follows:

w2c = dict()
for item in model.wv.vocab:
    w2c[item]=model.wv.vocab[item].count

If you want to sort it to see the most frequent words in the model, you can also do that so:

w2cSorted=dict(sorted(w2c.items(), key=lambda x: x[1],reverse=True))
Ahmadov
  • 1,567
  • 5
  • 31
  • 48