-1

I want my code to be able to split a text file into single and double character n-grams. For example, if the word 'dogs' came up, I would want 'do','og', and 'gs'. the problem is I can only seem to split the text into whole words.

I tried to use just a simple split() but that didn't seem to work for overlapping n-grams.

from collections import Counter 
from nltk.util import ngrams

def ngram_dist(fname, n):
    with open(fname, 'r') as fp:
        for lines in fp:
            for words in lines:
                    result = Counter(ngrams(fname.split(),n))
    return result
  • You want to split each word into a n-gram? Or each sentence? Or each file? Typically n-grams split apart a word they split apart a sentence. – Error - Syntactical Remorse Apr 17 '19 at 22:57
  • This should work: `b="dogs"; print([b[i:i+2] for i in range(len(b)-1)])` – l'L'l Apr 17 '19 at 23:01
  • @lll No this is definitely not a dupe of that, that only asks for character n-grams of **words**, this one wants to process **all lines in a file**. Please don't just rely on a question's title, often they need correcting. – smci Apr 18 '19 at 00:45
  • **Please don't post [duplicate questions](https://stackoverflow.com/questions/55737605/creating-n-gram-dictionary-with-frequencies), it's not allowed.** Can you merge one of these askings into the other (paste your code and details), then close the unwanted one? – smci Apr 18 '19 at 00:49

1 Answers1

2

Here is a solution. It counts spaces as characters but you can remove that if needed.

import re

def generate_ngrams(s, n):
    # Convert to lowercases
    s = s.lower()

    # Replace all none alphanumeric characters with spaces
    s = re.sub(r'[^a-zA-Z0-9\s]', ' ', s)

    # Break sentence in the token, remove empty tokens
    tokens = [token for token in s if token != ""]

    # Use the zip function to help us generate n-grams
    # Concatentate the tokens into ngrams and return
    ngrams = zip(*[tokens[i:] for i in range(n)])
    return ["".join(ngram) for ngram in ngrams]
print(generate_ngrams("My Dogs is sick", 2))