I have a file, "filename.txt". I need to get all the n-grams, say trigrams, along with their frequency, in a sorted manner. My aim is basically to get the most commonly used phrases.
How do I do this using nltk/scikit-learn?
I have a file, "filename.txt". I need to get all the n-grams, say trigrams, along with their frequency, in a sorted manner. My aim is basically to get the most commonly used phrases.
How do I do this using nltk/scikit-learn?
Here's a solution without nltk:
from collections import deque
def window(seq, n=3):
it = iter(seq)
win = deque((next(it, None) for _ in xrange(n-1)), maxlen=n)
for e in it:
win.append(e)
yield tuple(win)
def sorted_grams(doc, n=3):
counts = {}
for ngram in window(doc, n):
counts[ngram] = counts.get(ngram, 0) + 1
return sorted(((v,k) for k,v in counts.items()), reverse=True)
example_doc = 'it is a small world after all it is a small world after all'
for s in sorted_grams(example_doc.split(), 3):
print s