How do I get sorted frequency of phrases using n-gram analysis in Python?

Question

I have a file, "filename.txt". I need to get all the n-grams, say trigrams, along with their frequency, in a sorted manner. My aim is basically to get the most commonly used phrases.

How do I do this using nltk/scikit-learn?

possible duplicate of [Computing N Grams using Python](http://stackoverflow.com/questions/13423919/computing-n-grams-using-python) — Shawn Mehan, Aug 26 '15 at 22:17

score 1 · Answer 1 · answered Aug 26 '15 at 22:48

Here's a solution without nltk:

from collections import deque

def window(seq, n=3):
    it = iter(seq)
    win = deque((next(it, None) for _ in xrange(n-1)), maxlen=n)
    for e in it: 
        win.append(e)
        yield tuple(win)

def sorted_grams(doc, n=3):
    counts = {}
    for ngram in window(doc, n): 
        counts[ngram] = counts.get(ngram, 0) + 1 

    return sorted(((v,k) for k,v in counts.items()), reverse=True)


example_doc = 'it is a small world after all it is a small world after all'
for s in sorted_grams(example_doc.split(), 3): 
    print s

How do I get sorted frequency of phrases using n-gram analysis in Python?

1 Answers1