Create tokens from common bigram/trigrams

Question

Suppose I have the string:

"HMG-CoA reductase is a rate-limiting enzyme. HMG-CoA reductase is the primary enzyme in cholesterol synthesis."

I would like to compute frequencies of tokens in the string. However, I want 'HMG-CoA reductase' to be one token (i.e., I don't want frequencies of the individual words 'HMG-Coa' and 'reductase').

I thought a good approach would be to create a list of bigrams:

[HMG-CoA reductase, reductase is, ..., cholesterol synthesis]

and trigrams

[HMG-CoA reductase is, ..., in cholesterol synthesis]

And then compute the frequencies of the elements of each list of n-grams. If an element of the bigram or trigram lists approached (is above some threshold) the frequency of the elements in the unigram list, I would then know that the bigram or trigram, not the unigram, is my 'token'.

I want to do this on a large amount of unstructured text data. It's fairly standard text. One problem with my approach is that I would need to arbitrarily set the threshold. Is there already a library in NLTK for solving this problem, or does anyone know of a common approach?

This is too broad as a question, I am looking at this and the thing "arbitrarily set the threshold", and "is there a library in NLTK" together make this too open-ended question on stackoverflow. — Antti Haapala -- Слава Україні, Feb 16 '15 at 19:27
Ah [see this question](http://stackoverflow.com/questions/2452982/how-to-extract-common-significant-phrases-from-a-series-of-text-entries) — Antti Haapala -- Слава Україні, Feb 16 '15 at 19:33
No need to remove. And if you have more open-ended discussion you can come to the [Python chatroom](http://chat.stackoverflow.com/rooms/6/python) — Antti Haapala -- Слава Україні, Feb 16 '15 at 19:37

Create tokens from common bigram/trigrams

0 Answers0