Creating a basic ngram implementation in Python as a personal challenge. Started with unigrams and worked up to trigrams:
def unigrams(text):
uni = []
for token in text:
uni.append([token])
return uni
def bigrams(text):
bi = []
token_address = 0
for token in text[:len(text) - 1]:
bi.append([token, text[token_address + 1]])
token_address += 1
return bi
def trigrams(text):
tri = []
token_address = 0
for token in text[:len(text) - 2]:
tri.append([token, text[token_address + 1], text[token_address + 2]])
token_address += 1
return tri
Now the fun part, generalize to n-grams. The main problem with generalizing the approach I have here is creating the list of length n that goes into the append method. I thought initially that lambdas might be a way to do it, but I can't figure out how.
Also, other implementations I'm looking at are taking an entirely different tack (no surprise), e.g. here and here, so I'm starting to wonder if I'm at a dead end.
Before I give up on this approach, I'm curious: 1) is there a one line or pythonic method of creating an arbitrary list size in this manner? 2) what are the downsides of approaching the problem this way?