Suppose I have the string:
"HMG-CoA reductase is a rate-limiting enzyme. HMG-CoA reductase is the primary enzyme in cholesterol synthesis."
I would like to compute frequencies of tokens in the string. However, I want 'HMG-CoA reductase'
to be one token (i.e., I don't want frequencies of the individual words 'HMG-Coa'
and 'reductase'
).
I thought a good approach would be to create a list of bigrams:
[HMG-CoA reductase, reductase is, ..., cholesterol synthesis]
and trigrams
[HMG-CoA reductase is, ..., in cholesterol synthesis]
And then compute the frequencies of the elements of each list of n-grams. If an element of the bigram or trigram lists approached (is above some threshold) the frequency of the elements in the unigram list, I would then know that the bigram or trigram, not the unigram, is my 'token'.
I want to do this on a large amount of unstructured text data. It's fairly standard text. One problem with my approach is that I would need to arbitrarily set the threshold. Is there already a library in NLTK for solving this problem, or does anyone know of a common approach?