Generating Ngrams (Unigrams,Bigrams etc) from a large corpus of .txt files and their Frequency

Question

I need to write a program in NLTK that breaks a corpus (a large collection of txt files) into unigrams, bigrams, trigrams, fourgrams and fivegrams. I have already written code to input my files into the program.

The input is 300 .txt files written in English and I want the output in form of Ngrams and specially the frequency count.

I know that NLTK has Bigram and Trigram modules : http://www.nltk.org/_modules/nltk/model/ngram.html

but I am not that advanced to enter them into my program.

input: txt files NOT single sentences

output example:

Bigram [('Hi', 'How'), ('How', 'are'), ('are', 'you'), ('you', '?'), ('?', 'i'), ('i', 'am'), ('am', 'fine'), ('fine', 'and'), ('and', 'you')] 

Trigram: [('Hi', 'How', 'are'), ('How', 'are', 'you'), ('are', 'you', '?'), ('you', '?', 'i'), ('?', 'i', 'am'), ('i', 'am', 'fine'), ('am', 'fine', 'and'), ('fine', 'and', 'you')]

My code up to now is:

from nltk.corpus import PlaintextCorpusReader
corpus = 'C:/Users/jack3/My folder'
files = PlaintextCorpusReader(corpus, '.*')
ngrams=2

def generate(file, ngrams):
    for gram in range(0, ngrams):
    print((file[0:-4]+"_"+str(ngrams)+"_grams.txt").replace("/","_"))


for file in files.fileids():
generate(file, ngrams)

Any help what should be done next?

hellpanderr · Accepted Answer · 2015-09-07T16:19:07.283

Just use ntlk.ngrams.

import nltk
from nltk import word_tokenize
from nltk.util import ngrams
from collections import Counter

text = "I need to write a program in NLTK that breaks a corpus (a large collection of \
txt files) into unigrams, bigrams, trigrams, fourgrams and fivegrams.\ 
I need to write a program in NLTK that breaks a corpus"
token = nltk.word_tokenize(text)
bigrams = ngrams(token,2)
trigrams = ngrams(token,3)
fourgrams = ngrams(token,4)
fivegrams = ngrams(token,5)

print Counter(bigrams)

Counter({('program', 'in'): 2, ('NLTK', 'that'): 2, ('that', 'breaks'): 2,
 ('write', 'a'): 2, ('breaks', 'a'): 2, ('to', 'write'): 2, ('I', 'need'): 2,
 ('a', 'corpus'): 2, ('need', 'to'): 2, ('a', 'program'): 2, ('in', 'NLTK'): 2,
 ('and', 'fivegrams'): 1, ('corpus', '('): 1, ('txt', 'files'): 1, ('unigrams', 
','): 1, (',', 'trigrams'): 1, ('into', 'unigrams'): 1, ('trigrams', ','): 1,
 (',', 'bigrams'): 1, ('large', 'collection'): 1, ('bigrams', ','): 1, ('of',
 'txt'): 1, (')', 'into'): 1, ('fourgrams', 'and'): 1, ('fivegrams', '.'): 1,
 ('(', 'a'): 1, (',', 'fourgrams'): 1, ('a', 'large'): 1, ('.', 'I'): 1, 
('collection', 'of'): 1, ('files', ')'): 1})

UPDATE (with pure python):

import os

corpus = []
path = '.'
for i in os.walk(path).next()[2]:
    if i.endswith('.txt'):
        f = open(os.path.join(path,i))
        corpus.append(f.read())
frequencies = Counter([])
for text in corpus:
    token = nltk.word_tokenize(text)
    bigrams = ngrams(token, 2)
    frequencies += Counter(bigrams)

Hi Hellpanderrr THANKS , I need that part that allows me to insert my whole package of data(a folder full of txt files) into my program so that it can run through my txt files and give me the output from the txt files. So instead of reading text = "xxxxxx" i need it refer to my folder of txt files. — Arash, Sep 07 '15 at 15:46
That's because `ngram` function returns a generator, you need to call a `list` function on it to actually extract the content. — hellpanderr, Sep 07 '15 at 17:30
What kinda list? can you tell me where in above code should it go? — Arash, Sep 07 '15 at 18:14
If you want to see the content of what `ngrams` function returns, you need to send it to the `list` function e.g. `list(ngrams(token, 2))`. But the above code should work fine, just plug in the path to your files. — hellpanderr, Sep 07 '15 at 18:30

Yann Dubois · Answer 2 · 2019-04-08T01:02:42.337

If efficiency is an issue and you have to build multiple different n-grams, but you want to use pure python I would do:

from itertools import chain

def n_grams(seq, n=1):
    """Returns an iterator over the n-grams given a list_tokens"""
    shift_token = lambda i: (el for j,el in enumerate(seq) if j>=i)
    shifted_tokens = (shift_token(i) for i in range(n))
    tuple_ngrams = zip(*shifted_tokens)
    return tuple_ngrams # if join in generator : (" ".join(i) for i in tuple_ngrams)

def range_ngrams(list_tokens, ngram_range=(1,2)):
    """Returns an itirator over all n-grams for n in range(ngram_range) given a list_tokens."""
    return chain(*(n_grams(list_tokens, i) for i in range(*ngram_range)))

Usage :

>>> input_list = input_list = 'test the ngrams generator'.split()
>>> list(range_ngrams(input_list, ngram_range=(1,3)))
[('test',), ('the',), ('ngrams',), ('generator',), ('test', 'the'), ('the', 'ngrams'), ('ngrams', 'generator'), ('test', 'the', 'ngrams'), ('the', 'ngrams', 'generator')]

~Same speed as NLTK:

import nltk
%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
nltk.ngrams(input_list,n=5)
# 7.02 ms ± 79 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
n_grams(input_list,n=5)
# 7.01 ms ± 103 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
nltk.ngrams(input_list,n=1)
nltk.ngrams(input_list,n=2)
nltk.ngrams(input_list,n=3)
nltk.ngrams(input_list,n=4)
nltk.ngrams(input_list,n=5)
# 7.32 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
input_list = 'test the ngrams interator vs nltk '*10**6
range_ngrams(input_list, ngram_range=(1,6))
# 7.13 ms ± 165 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Repost from my previous answer.

score 4 · Answer 3 · answered Sep 06 '17 at 04:05

Here is a simple example using pure Python to generate any ngram:

>>> def ngrams(s, n=2, i=0):
...     while len(s[i:i+n]) == n:
...         yield s[i:i+n]
...         i += 1
...
>>> txt = 'Python is one of the awesomest languages'

>>> unigram = ngrams(txt.split(), n=1)
>>> list(unigram)
[['Python'], ['is'], ['one'], ['of'], ['the'], ['awesomest'], ['languages']]

>>> bigram = ngrams(txt.split(), n=2)
>>> list(bigram)
[['Python', 'is'], ['is', 'one'], ['one', 'of'], ['of', 'the'], ['the', 'awesomest'], ['awesomest', 'languages']]

>>> trigram = ngrams(txt.split(), n=3)
>>> list(trigram)
[['Python', 'is', 'one'], ['is', 'one', 'of'], ['one', 'of', 'the'], ['of', 'the', 'awesomest'], ['the', 'awesomest',
'languages']]

Really super helpful! – Muhammad Haseeb Jul 27 '19 at 02:22 — Muhammad Haseeb, Jul 27 '19 at 02:22

score 3 · Answer 4 · edited May 23 '17 at 12:26

Ok, so since you asked for an NLTK solution this might not be exactly what you where looking for but, have you considered TextBlob? It has a NLTK backend but it has a simpler syntax. It would look something like this:

from textblob import TextBlob

text = "Paste your text or text-containing variable here" 
blob = TextBlob(text)
ngram_var = blob.ngrams(n=3)
print(ngram_var)

Output:
[WordList(['Paste', 'your', 'text']), WordList(['your', 'text', 'or']), WordList(['text', 'or', 'text-containing']), WordList(['or', 'text-containing', 'variable']), WordList(['text-containing', 'variable', 'here'])]

You would of course still need to use Counter or some other method to add a count per ngram.

However, the fastest approach (by far) I have been able to find to both create any ngram you'd like and also count in a single function them stems from this post from 2012 and uses Itertools. It's great.

Agung Dewandaru · Answer 5 · 2019-01-01T14:59:15.347

The answer of @hellpander above correct, but not efficient for a very large corpus (I faced difficulties with ~650K documents). The code would slow down considerably everytime frequencies are updated, due to the expensive lookup of the dictionary as the content grows. So you will need to have additional buffer variable to help cache the frequencies Counter of @hellpander answer. Hence, isntead of doing key lookup for a very large frequencies (dictionary) everytime a new document is iterated, you would add it to the temporary, smaller Counter dict. Then, after some iterations, it will be add up to the global frequencies. This way it'll be much faster because the huge dictionary lookup is done much less frequently.

import os

corpus = []
path = '.'
for i in os.walk(path).next()[2]:
    if i.endswith('.txt'):
        f = open(os.path.join(path,i))
        corpus.append(f.read())
frequencies = Counter([])

for i in range(0, len(corpus)):
    token = nltk.word_tokenize(corpus[i])
    bigrams = ngrams(token, 2)
    f += Counter(bigrams)
    if (i%10000 == 0):
        # store to global frequencies counter and clear up f every 10000 docs.
        frequencies += Counter(bigrams)
        f = Counter([])

score -1 · Answer 6 · answered Jun 12 '18 at 11:58

-1

maybe it helps. see link

import spacy  
nlp_en = spacy.load("en_core_web_sm")
[x.text for x in doc]

answered Jun 12 '18 at 11:58

madjardi

5,649
2
37
37

Generating Ngrams (Unigrams,Bigrams etc) from a large corpus of .txt files and their Frequency

6 Answers6

Linked