1

Suppose I have the string:

"HMG-CoA reductase is a rate-limiting enzyme. HMG-CoA reductase is the primary enzyme in cholesterol synthesis."

I would like to compute frequencies of tokens in the string. However, I want 'HMG-CoA reductase' to be one token (i.e., I don't want frequencies of the individual words 'HMG-Coa' and 'reductase').

I thought a good approach would be to create a list of bigrams:

[HMG-CoA reductase, reductase is, ..., cholesterol synthesis]

and trigrams

[HMG-CoA reductase is, ..., in cholesterol synthesis]

And then compute the frequencies of the elements of each list of n-grams. If an element of the bigram or trigram lists approached (is above some threshold) the frequency of the elements in the unigram list, I would then know that the bigram or trigram, not the unigram, is my 'token'.

I want to do this on a large amount of unstructured text data. It's fairly standard text. One problem with my approach is that I would need to arbitrarily set the threshold. Is there already a library in NLTK for solving this problem, or does anyone know of a common approach?

Sam Weisenthal
  • 2,791
  • 9
  • 28
  • 66

0 Answers0