0

Creating a basic ngram implementation in Python as a personal challenge. Started with unigrams and worked up to trigrams:

def unigrams(text):
    uni = []
    for token in text:
        uni.append([token])
    return uni

def bigrams(text):
    bi = []
    token_address = 0
    for token in text[:len(text) - 1]:
        bi.append([token, text[token_address + 1]])
        token_address += 1
    return bi

def trigrams(text):
    tri = []
    token_address = 0
    for token in text[:len(text) - 2]:
        tri.append([token, text[token_address + 1], text[token_address + 2]])
        token_address += 1
    return tri

Now the fun part, generalize to n-grams. The main problem with generalizing the approach I have here is creating the list of length n that goes into the append method. I thought initially that lambdas might be a way to do it, but I can't figure out how.

Also, other implementations I'm looking at are taking an entirely different tack (no surprise), e.g. here and here, so I'm starting to wonder if I'm at a dead end.

Before I give up on this approach, I'm curious: 1) is there a one line or pythonic method of creating an arbitrary list size in this manner? 2) what are the downsides of approaching the problem this way?

Community
  • 1
  • 1
acpigeon
  • 1,699
  • 9
  • 20
  • 30
  • see http://stackoverflow.com/questions/18658106/quick-implementation-of-character-n-grams-using-python/22427077#22427077 – alvas Mar 15 '14 at 21:47

3 Answers3

2

The following function should work for a general n-gram model.

def ngram(text,grams):  
    model=[]
    # model will contain n-gram strings
    count=0
    for token in text[:len(text)-grams+1]:  
       model.append(text[count:count+grams])  
       count=count+1  
    return model
Faisal Maqbool
  • 121
  • 1
  • 8
jitendra
  • 1,438
  • 2
  • 19
  • 40
1

As a convenient one-liner:

def retrieve_ngrams(txt, n):
    return [txt[i:i+n] for i in range(len(txt)-(n-1))]
vermillon
  • 543
  • 2
  • 8
  • 19
0

Try this.

  def get_ngrams(wordlist,n):
      ngrams = []
      for i in range(len(wordlist)-(n-1)):
          ngrams.append(wordlist[i:i+n])
      return ngrams
adikh
  • 306
  • 2
  • 16