Implementing ngrams in Python

Question

Creating a basic ngram implementation in Python as a personal challenge. Started with unigrams and worked up to trigrams:

def unigrams(text):
    uni = []
    for token in text:
        uni.append([token])
    return uni

def bigrams(text):
    bi = []
    token_address = 0
    for token in text[:len(text) - 1]:
        bi.append([token, text[token_address + 1]])
        token_address += 1
    return bi

def trigrams(text):
    tri = []
    token_address = 0
    for token in text[:len(text) - 2]:
        tri.append([token, text[token_address + 1], text[token_address + 2]])
        token_address += 1
    return tri

Now the fun part, generalize to n-grams. The main problem with generalizing the approach I have here is creating the list of length n that goes into the append method. I thought initially that lambdas might be a way to do it, but I can't figure out how.

Also, other implementations I'm looking at are taking an entirely different tack (no surprise), e.g. here and here, so I'm starting to wonder if I'm at a dead end.

Before I give up on this approach, I'm curious: 1) is there a one line or pythonic method of creating an arbitrary list size in this manner? 2) what are the downsides of approaching the problem this way?

see http://stackoverflow.com/questions/18658106/quick-implementation-of-character-n-grams-using-python/22427077#22427077 — alvas, Mar 15 '14 at 21:47

score 2 · Accepted Answer · edited Nov 22 '18 at 10:51

2

The following function should work for a general n-gram model.

def ngram(text,grams):  
    model=[]
    # model will contain n-gram strings
    count=0
    for token in text[:len(text)-grams+1]:  
       model.append(text[count:count+grams])  
       count=count+1  
    return model

edited Nov 22 '18 at 10:51

Faisal Maqbool

121
1
8

answered Jan 31 '13 at 03:05

jitendra

1,438
2
19
40

score 1 · Answer 2 · answered Jan 27 '14 at 17:47

1

As a convenient one-liner:

def retrieve_ngrams(txt, n):
    return [txt[i:i+n] for i in range(len(txt)-(n-1))]

answered Jan 27 '14 at 17:47

vermillon

543
2
8
19

score 0 · Answer 3 · answered Mar 16 '20 at 10:38

0

Try this.

  def get_ngrams(wordlist,n):
      ngrams = []
      for i in range(len(wordlist)-(n-1)):
          ngrams.append(wordlist[i:i+n])
      return ngrams

answered Mar 16 '20 at 10:38

adikh

306
2
16

Implementing ngrams in Python

3 Answers3

Linked