Creating single and double character n-grams from text files

Question

I want my code to be able to split a text file into single and double character n-grams. For example, if the word 'dogs' came up, I would want 'do','og', and 'gs'. the problem is I can only seem to split the text into whole words.

I tried to use just a simple split() but that didn't seem to work for overlapping n-grams.

from collections import Counter 
from nltk.util import ngrams

def ngram_dist(fname, n):
    with open(fname, 'r') as fp:
        for lines in fp:
            for words in lines:
                    result = Counter(ngrams(fname.split(),n))
    return result

You want to split each word into a n-gram? Or each sentence? Or each file? Typically n-grams split apart a word they split apart a sentence. — Error - Syntactical Remorse, Apr 17 '19 at 22:57
This should work: `b="dogs"; print([b[i:i+2] for i in range(len(b)-1)])` — l'L'l, Apr 17 '19 at 23:01
@lll No this is definitely not a dupe of that, that only asks for character n-grams of **words**, this one wants to process **all lines in a file**. Please don't just rely on a question's title, often they need correcting. — smci, Apr 18 '19 at 00:45
**Please don't post [duplicate questions](https://stackoverflow.com/questions/55737605/creating-n-gram-dictionary-with-frequencies), it's not allowed.** Can you merge one of these askings into the other (paste your code and details), then close the unwanted one? — smci, Apr 18 '19 at 00:49

score 2 · Answer 1 · answered Apr 17 '19 at 23:03

Here is a solution. It counts spaces as characters but you can remove that if needed.

import re

def generate_ngrams(s, n):
    # Convert to lowercases
    s = s.lower()

    # Replace all none alphanumeric characters with spaces
    s = re.sub(r'[^a-zA-Z0-9\s]', ' ', s)

    # Break sentence in the token, remove empty tokens
    tokens = [token for token in s if token != ""]

    # Use the zip function to help us generate n-grams
    # Concatentate the tokens into ngrams and return
    ngrams = zip(*[tokens[i:] for i in range(n)])
    return ["".join(ngram) for ngram in ngrams]
print(generate_ngrams("My Dogs is sick", 2))

Creating single and double character n-grams from text files

1 Answers1