18

I'm using NLTK to search for n-grams in a corpus but it's taking a very long time in some cases. I've noticed calculating n-grams isn't an uncommon feature in other packages (apparently Haystack has some functionality for it). Does this mean there's a potentially faster way of finding n-grams in my corpus if I abandon NLTK? If so, what can I use to speed things up?

Michael J. Barber
  • 24,518
  • 9
  • 68
  • 88
Trindaz
  • 17,029
  • 21
  • 82
  • 111
  • More reading for those interested: http://packages.python.org/Whoosh/ngrams.html – Trindaz Sep 29 '11 at 05:06
  • Related question: http://stackoverflow.com/questions/21883108/fast-optimize-n-gram-implementations-in-python – dmcc Apr 21 '14 at 18:14

4 Answers4

26

Since you didn't indicate whether you want word or character-level n-grams, I'm just going to assume the former, without loss of generality.

I also assume you start with a list of tokens, represented by strings. What you can easily do is write n-gram extraction yourself.

def ngrams(tokens, MIN_N, MAX_N):
    n_tokens = len(tokens)
    for i in xrange(n_tokens):
        for j in xrange(i+MIN_N, min(n_tokens, i+MAX_N)+1):
            yield tokens[i:j]

Then replace the yield with the actual action you want to take on each n-gram (add it to a dict, store it in a database, whatever) to get rid of the generator overhead.

Finally, if it's really not fast enough, convert the above to Cython and compile it. Example using a defaultdict instead of yield:

def ngrams(tokens, int MIN_N, int MAX_N):
    cdef Py_ssize_t i, j, n_tokens

    count = defaultdict(int)

    join_spaces = " ".join

    n_tokens = len(tokens)
    for i in xrange(n_tokens):
        for j in xrange(i+MIN_N, min(n_tokens, i+MAX_N)+1):
            count[join_spaces(tokens[i:j])] += 1

    return count
Cerin
  • 60,957
  • 96
  • 316
  • 522
Fred Foo
  • 355,277
  • 75
  • 744
  • 836
  • 3
    The newer versions of Cython recognize Python for statements and speeds them up if possible. Further you have a method lookup in the inner iteration. defining 'tokenjoiner=" ".join' outside the loop and replacing the inner " ".join should speed things up. – rocksportrocker Sep 29 '11 at 11:04
  • and you can rewrite the inner line with "count.get(....) +=1" introduce another var for avoiding method lookup. – rocksportrocker Sep 29 '11 at 11:11
9

You might find a pythonic, elegant and fast ngram generation function using zip and splat (*) operator here :

def find_ngrams(input_list, n):
  return zip(*[input_list[i:] for i in range(n)])
Wxds
  • 91
  • 1
  • 1
0

For character-level n-grams you could use the following function

def ngrams(text, n):
    n-=1
    return [text[i-n:i+1] for i,char in enumerate(text)][n:] 
asmaier
  • 11,132
  • 11
  • 76
  • 103
0
def generate_ngrams(words, ngram=2):
  return [words[i:i+ngram] for i in range(len(words)-ngram+1)]



sentence = "I really like python, it's pretty awesome."
words = sentence.split()
words

['I', 'really', 'like', 'python,', "it's", 'pretty', 'awesome.']


res = generate_ngrams(words, ngram=2)
res

[['I', 'really'],
 ['really', 'like'],
 ['like', 'python,'],
 ['python,', "it's"],
 ["it's", 'pretty'],
 ['pretty', 'awesome.']]


res = generate_ngrams(words, ngram=3)
res

[['I', 'really', 'like'],
 ['really', 'like', 'python,'],
 ['like', 'python,', "it's"],
 ['python,', "it's", 'pretty'],
 ["it's", 'pretty', 'awesome.']]


res = generate_ngrams(words, ngram=4)
res

[['I', 'really', 'like', 'python,'],
 ['really', 'like', 'python,', "it's"],
 ['like', 'python,', "it's", 'pretty'],
 ['python,', "it's", 'pretty', 'awesome.']]
Biranchi
  • 16,120
  • 23
  • 124
  • 161