13

I'm trying to find k most common n-grams from a large corpus. I've seen lots of places suggesting the naïve approach - simply scanning through the entire corpus and keeping a dictionary of the count of all n-grams. Is there a better way to do this?

bendl
  • 1,583
  • 1
  • 18
  • 41
  • Possible duplicate of http://cs.stackexchange.com/questions/8972/optimal-algorithm-for-finding-all-ngrams-from-a-pre-defined-set-in-a-text – pltrdy Feb 21 '17 at 17:23
  • What are you comparing against? How big is the corpus? I think you can easily count number of ngrams in C++ rather quickly for a huge corpus and even in Python it's pretty fast =) – alvas Feb 22 '17 at 08:23
  • Do you mean character ngrams or word ngrams? – alvas Feb 22 '17 at 08:24
  • I'm using word ngrams, but I imagine that character ngrams would generalize. As for the corpus, this should be able to scale to a corpus as large as 20gb or so, run on a hadoop cluster – bendl Feb 23 '17 at 00:01

1 Answers1

16

In Python, using NLTK:

$ wget http://norvig.com/big.txt
$ python
>>> from collections import Counter
>>> from nltk import ngrams
>>> bigtxt = open('big.txt').read()
>>> ngram_counts = Counter(ngrams(bigtxt.split(), 2))
>>> ngram_counts.most_common(10)
[(('of', 'the'), 12422), (('in', 'the'), 5741), (('to', 'the'), 4333), (('and', 'the'), 3065), (('on', 'the'), 2214), (('at', 'the'), 1915), (('by', 'the'), 1863), (('from', 'the'), 1754), (('of', 'a'), 1700), (('with', 'the'), 1656)]

In Python, native (see Fast/Optimize N-gram implementations in python):

>>> import collections
>>> def ngrams(text, n=2):
...     return zip(*[text[i:] for i in range(n)])
>>> ngram_counts = collections.Counter(ngrams(bigtxt.split(), 2))
>>> ngram_counts.most_common(10)
    [(('of', 'the'), 12422), (('in', 'the'), 5741), (('to', 'the'), 4333), (('and', 'the'), 3065), (('on', 'the'), 2214), (('at', 'the'), 1915), (('by', 'the'), 1863), (('from', 'the'), 1754), (('of', 'a'), 1700), (('with', 'the'), 1656)]

In Julia, see Generate ngrams with Julia

import StatsBase: countmap
import Iterators: partition
bigtxt = readstring(open("big.txt"))
ngram_counts = countmap(collect(partition(split(bigtxt), 2, 1)))

Rough timing:

$ time python ngram-test.py # With NLTK.

real    0m3.166s
user    0m2.274s
sys 0m0.528s

$ time python ngram-native-test.py 

real    0m1.521s
user    0m1.317s
sys 0m0.145s

$ time julia ngram-test.jl 

real    0m3.573s
user    0m3.188s
sys 0m0.306s
StevenWernerCS
  • 839
  • 9
  • 15
alvas
  • 115,346
  • 109
  • 446
  • 738