How to split word to ngrams in Python?

Question

I've got this question. I should split word to ngrams (for example: word ADVENTURE has three 4grams - ADVE; ENTU; TURE). There is a book file document (that's the reason for counter and isalpha), which is I don't have here, so I'm using only a list of 2 words. This is my code in Python:

words = ['adven', 'adventure']
def ngrams(words, n):
    counter = {} 
    for word in words:
        if (len(word)-1) >= n:
            for i in range(0, len(word)):
                if word.isalpha() == True:
                    ngram = ""
                    for i in range(len(word)):
                            ngram += word[i:n:]
                            if len(ngram) == n:
                                ngram.join(counter)
                                counter[ngram] = counter.get(ngram, 0) + 1
    return counter

print(trotl(words, 4))

This is what the code gives me:
{'adve': 14}

I don't care about the values in it but I'm not so good at strings and I don't know what I should do to gives me the three 4grams. I try to do "ngram += word[i::]" but that gives me None. Please help me, this is my school homework and I can't do more functions when this ngrams doesn't work.

Doesn't adventure have 6 4grams ADVE DVEN VENT ENTU NTUR TURE? — Stuart, Dec 18 '22 at 18:06
Does this answer your question? [Quick implementation of character n-grams for word](https://stackoverflow.com/questions/18658106/quick-implementation-of-character-n-grams-for-word) — Stuart, Dec 18 '22 at 18:08
no, it should takes the last letter from previous ngram and when there's last ngram with less than n letters it takes last letters from the previous ngram. So 4grams in word 'the' is None. However in word 'adventure' is: 'adve' (takes the e to beginning another 4gram) 'entu' (left letters are ure and it's not a 4gram so it takes the letter t with') 'ture' another word 'advent': 'adve' (left 'ent') - 'vent' (takes v to becoming 4gram) — oliv, Dec 18 '22 at 18:12

score 0 · Answer 1 · edited Dec 18 '22 at 18:38

0

use nltk.ngrams for this job:

from nltk import ngrams

edited Dec 18 '22 at 18:38

Donald Duck

8,409
22
75
99

answered Dec 18 '22 at 18:17

The Lord

72
1
7

oh, I don't mention that I can't use any imports. I would do that immediately, if I could. – oliv Dec 18 '22 at 18:19

fre · Answer 2 · 2022-12-18T18:58:27.060

I think the definition you have of n-grams is a little bit different from the conventional, as pointed out by @Stuart in his comment. However, with the definition from your comment, I think the following would solve your problem.

def n_grams(word, n):

    # We can't find n-grams if the word has less than n letters.
    if n > len(word):
        return []

    output = []
    start_idx = 0
    end_idx = start_idx + n

    # Grab all n-grams except the last one
    while end_idx < len(word):
        n_gram = word[start_idx:end_idx]
        output.append(n_gram)
        start_idx = end_idx - 1
        end_idx = start_idx + n

    # Grab the last n-gram
    last_n_gram_start = len(word) - n
    last_n_gram_end = len(word)
    output.append(word[last_n_gram_start:last_n_gram_end])

    return output

score 0 · Answer 3 · answered Dec 18 '22 at 20:52

If I've understood the rules correctly, you can do it like this

def special_ngrams(word, n):
    """ Yield character ngrams of word that overlap by only one character, 
        except for the last two ngrams which may overlap by more than one 
        character. The first and last ngrams of the word are always included. """
    for start in range(0, len(word) - n, n - 1):
        yield word[start:start + n]
    yield word[-n:]

for word in "hello there this is a test", "adventure", "tyrannosaurus", "advent":
    print(list(special_ngrams(word, 4)))

How to split word to ngrams in Python?

3 Answers3