0

I just came across this StackOverflow post on word counts in a doc2vec model vocabulary. I wonder if there is another method to retrieve the word frequency, other than

for word, vocab_obj in model.wv.vocab.items():
    print(str(word) + str(vocab_obj.count))

Maybe there is a more elegant way via the gensim library (i.e. output words and frequencies in a txt file)?

Christopher
  • 2,120
  • 7
  • 31
  • 58

1 Answers1

0

Nope, that in-memory dictionary (model.wv.vocab) is where the counts are stored for consultation, and any other further choices for display/storage are up to the user's own code.

gojomo
  • 52,260
  • 14
  • 86
  • 115
  • Okay, thanks for clarification. Do your have examples of word-snippets that can help to sort and show the word count? In case, I would recommend this for a future gensim version. It might help to inspect the data and whether cleaning (i..e of outliers) worked as expected. – Christopher Jan 31 '18 at 08:31
  • 1
    By default the words are given their indexes in the vector-array in decreasing frequency, so if you look at the list `model.wv.index2word`, the word-tokens will be in most-frequent to least-frequent order. Other than that, it'd depend on your specific need - it's pretty straightforward Python to check if a word is present, or sort by a count, etc. For example it's a 1-liner to change existing word-vectors dict to a standard Python `Counter` via a dict-comprehension: `counts = Counter({word: vocab.count for (word, vocab) in model.wv.vocab.items()` – which then has a `most_common()` function. – gojomo Feb 01 '18 at 04:43