-2

I have a file, "filename.txt". I need to get all the n-grams, say trigrams, along with their frequency, in a sorted manner. My aim is basically to get the most commonly used phrases.

How do I do this using nltk/scikit-learn?

n00b
  • 1,549
  • 2
  • 14
  • 33

1 Answers1

1

Here's a solution without nltk:

from collections import deque

def window(seq, n=3):
    it = iter(seq)
    win = deque((next(it, None) for _ in xrange(n-1)), maxlen=n)
    for e in it: 
        win.append(e)
        yield tuple(win)

def sorted_grams(doc, n=3):
    counts = {}
    for ngram in window(doc, n): 
        counts[ngram] = counts.get(ngram, 0) + 1 

    return sorted(((v,k) for k,v in counts.items()), reverse=True)


example_doc = 'it is a small world after all it is a small world after all'
for s in sorted_grams(example_doc.split(), 3): 
    print s
fivetentaylor
  • 1,277
  • 7
  • 11