Extract token frequencies from gensim model

Question

Questions like 1 and 2 give answers for retrieving vocabulary frequencies from gensim word2vec models.

For some reason, they actually just give a deprecating counter from n (size of vocab) to 0, alongside the most frequent tokens, ordered.

For example:

for idx, w in enumerate(model.vocab):
    print(idx, w, model.vocab[w].count)

Gives:

0 </s> 111051
1 . 111050
2 , 111049
3 the 111048
4 of 111047
...
111050 tokiwa 2
111051 muzorewa 1

Why is it doing this? How can I extract term frequencies from the model, given a word?

Shouldn't you write "model.wv.vocab"? – Peyman Oct 01 '20 at 09:36 — Peyman, Oct 01 '20 at 09:36

gojomo · Accepted Answer · 2020-10-01T17:23:33.090

Those answers are correct for reading the declared token-counts out of a model which has them.

But in some cases, your model may only have been initialized with a fake, descending-by-1 count for each word. This is most likely, in using Gensim, if it was loaded from a source where either the counts weren't available, or weren't used.

In particular, if you created the model using load_word2vec_format(), that simple vectors-only format (whether binary or plain-text) inherently contains no word counts. But such words are almost always, by convention, sorted in most-frequent to least-frequent order.

So, Gensim has chosen, when frequencies are not present, to synthesize fake counts, with linearly descending int values, where the (first) most-frequent word begins with the count of all unique words, and the (last) least-frequent word has a count of 1.

(I'm not sure this is a good idea, but Gensim's been doing it for a while, and it ensures code relying on the per-token count won't break, and will preserve the original order, though obviously not the unknowable original true-proportions.)

In some cases, the original source of the file may have saved a separate .vocab file with the word-frequencies alongside the word2vec_format vectors. (In Google's original word2vec.c code release, this is the file generated by the optional -save-vocab flag. In Gensim's .save_word2vec_format() method, the optional fvocab parameter can be used to generate this side file.)

If so, that 'vocab' frequencies filename may be supplied, when you call .load_word2vec_format(), as the fvocab parameter - and then your vector-set will have true counts.

If you word-vectors were originally created in Gensim from a corpus giving actual frequencies, and were always saved/loaded using the Gensim native functions .save()/.load() which use an extended form of Python-pickling, then the original true count info will never have been lost.

If you've lost the original frequency data, but you know the data was from a real natural-language source, and you want a more realistic (but still faked) set of frequencies, an option could be to use the Zipfian distribution. (Real natural-language usage frequencies tend to roughly fit this 'tall head, long tail' distribution.) A formula for creating such more-realistic dummy counts is available in the answer:

Gensim: Any chance to get word frequency in Word2Vec format?

Extract token frequencies from gensim model

1 Answers1