Those answers are correct for reading the declared token-counts out of a model which has them.
But in some cases, your model may only have been initialized with a fake, descending-by-1 count for each word. This is most likely, in using Gensim, if it was loaded from a source where either the counts weren't available, or weren't used.
In particular, if you created the model using load_word2vec_format()
, that simple vectors-only format (whether binary
or plain-text) inherently contains no word counts. But such words are almost always, by convention, sorted in most-frequent to least-frequent order.
So, Gensim has chosen, when frequencies are not present, to synthesize fake counts, with linearly descending int values, where the (first) most-frequent word begins with the count of all unique words, and the (last) least-frequent word has a count of 1.
(I'm not sure this is a good idea, but Gensim's been doing it for a while, and it ensures code relying on the per-token count
won't break, and will preserve the original order, though obviously not the unknowable original true-proportions.)
In some cases, the original source of the file may have saved a separate .vocab
file with the word-frequencies alongside the word2vec_format
vectors. (In Google's original word2vec.c
code release, this is the file generated by the optional -save-vocab
flag. In Gensim's .save_word2vec_format()
method, the optional fvocab
parameter can be used to generate this side file.)
If so, that 'vocab' frequencies filename may be supplied, when you call .load_word2vec_format()
, as the fvocab
parameter - and then your vector-set will have true counts.
If you word-vectors were originally created in Gensim from a corpus giving actual frequencies, and were always saved/loaded using the Gensim native functions .save()
/.load()
which use an extended form of Python-pickling, then the original true count
info will never have been lost.
If you've lost the original frequency data, but you know the data was from a real natural-language source, and you want a more realistic (but still faked) set of frequencies, an option could be to use the Zipfian distribution. (Real natural-language usage frequencies tend to roughly fit this 'tall head, long tail' distribution.) A formula for creating such more-realistic dummy counts is available in the answer:
Gensim: Any chance to get word frequency in Word2Vec format?