2

I am doing my research with fasttext pre-trained model and I need word frequency to do further analysis. Does the .vec or .bin files provided on fasttext website contain the info of word frequency? if yes, how do I get?

I am using load_word2vec_format to load the model tried using model.wv.vocab[word].count, which only gives you the word frequency rank not the original word frequency.

qby pony
  • 33
  • 4

1 Answers1

2

I don't believe those formats include any word frequency information.

To the extent any pre-trained word-vectors declare what they were trained on – like, say, Wikipedia text – you could go back to the training corpus (or some reasonable approximation) to perform your own frequency-count. Even if you've only got a "similar" corpus, the frequencies might be "close enough" for your analytical need.

Similarly, you could potentially use the frequency-rank to synthesize a dummy frequency table, using Zipf's Law, which roughly holds for normal natural-language corpora. Again, the relative proportions between words might be roughly close enough to the real proportions for your need, even with real/precise frequencies as were used during word-vector training.

Synthesizing the version of the Zipf's law formula on the Wikipedia page that makes use of the Harmonic number (H) in the denominator, with the efficient approximation of H given in this answer, we can create a function that, given a word's (starting at 1) rank and the total number of unique words, gives the proportionate frequency predicted by Zipf's law:

from numpy import euler_gamma
from scipy.special import digamma

def digamma_H(s):
    """ If s is complex the result becomes complex. """
    return digamma(s + 1) + euler_gamma

def zipf_at(k_rank, N_total):
    return 1.0 / (k_rank * digamma_H(N_total))

Then, if you had a pretrained set of 1 million word-vectors, you could estimate the first word's frequency as:

>>> zipf_at(1, 1000000)
0.06947953777315177
gojomo
  • 52,260
  • 14
  • 86
  • 115
  • 2
    It's a pity that they didn't provide frequency info. Thank you for your answer and suggestions I'll try this instead of going back to those TB corpus(which seems more realistic to me) appreciated! – qby pony Nov 06 '19 at 19:58