Merge generator objects to calculate frequency in NLTK

Question

I am trying to count frequency of various ngrams using ngram and freqDist functions in nltk. Due to the fact that the ngram function output is a generator object, I would like to merge the output from each ngram before calculating frequency. However, I am running into problems to merge the various generator objects.

I have tried itertools.chain, which created an itertools object, rather than merge the generators. I have finally settled on permutations, but to parse the objects afterwards seems redundant.

The working code thus far is:

import nltk
from nltk import word_tokenize, pos_tag
from nltk.collocations import *
from itertools import *
from nltk.util import ngrams
import re
corpus = 'testing sentences to see if if if this works'
token = word_tokenize(corpus)
unigrams = ngrams(token,1)
bigrams = ngrams(token,2)
trigrams = ngrams(token,3)


perms = list(permutations([unigrams,bigrams,trigrams]))
fdist = nltk.FreqDist(perms)
for x,y in fdist.items():
    for k in x:
        for v in k:
            words = '_'.join(v)
            print words, y

As you can see in the results, freq dist is not calculating the words from the individual generator objects properly as each has a frequency of 1. Is there a more pythonic way to do properly do this?

score 7 · Accepted Answer · answered Sep 27 '17 at 15:28

Use everygrams, it returns the all n-grams given a range of n.

>>> from nltk import everygrams
>>> from nltk import FreqDist
>>> corpus = 'testing sentences to see if if if this works'
>>> everygrams(corpus.split(), 1, 3)
<generator object everygrams at 0x7f4e272e9730>
>>> list(everygrams(corpus.split(), 1, 3))
[('testing',), ('sentences',), ('to',), ('see',), ('if',), ('if',), ('if',), ('this',), ('works',), ('testing', 'sentences'), ('sentences', 'to'), ('to', 'see'), ('see', 'if'), ('if', 'if'), ('if', 'if'), ('if', 'this'), ('this', 'works'), ('testing', 'sentences', 'to'), ('sentences', 'to', 'see'), ('to', 'see', 'if'), ('see', 'if', 'if'), ('if', 'if', 'if'), ('if', 'if', 'this'), ('if', 'this', 'works')]

To combine the counting of different orders of ngrams:

>>> from nltk import everygrams
>>> from nltk import FreqDist
>>> corpus = 'testing sentences to see if if if this works'.split()
>>> fd = FreqDist(everygrams(corpus, 1, 3))
>>> fd
FreqDist({('if',): 3, ('if', 'if'): 2, ('to', 'see'): 1, ('sentences', 'to', 'see'): 1, ('if', 'this'): 1, ('to', 'see', 'if'): 1, ('works',): 1, ('testing', 'sentences', 'to'): 1, ('sentences', 'to'): 1, ('sentences',): 1, ...})

Alternatively, FreqDist is essentially a collections.Counter sub-class, so you can combine counters as such:

>>> from collections import Counter
>>> x = Counter([1,2,3,4,4,5,5,5])
>>> y = Counter([1,1,1,2,2])
>>> x + y
Counter({1: 4, 2: 3, 5: 3, 4: 2, 3: 1})
>>> x

>>> from nltk import FreqDist
>>> FreqDist(['a', 'a', 'b'])
FreqDist({'a': 2, 'b': 1})
>>> a = FreqDist(['a', 'a', 'b'])
>>> b = FreqDist(['b', 'b', 'c', 'd', 'e'])
>>> a + b
FreqDist({'b': 3, 'a': 2, 'c': 1, 'e': 1, 'd': 1})

score 2 · Answer 2 · answered Sep 27 '17 at 16:44

Alvas is right, nltk.everygrams is the perfect tool for this job. But merging several iterators is really not that hard, nor that uncommon, so you should know how to do it. The key is that any iterator can be converted to a list, but it's best to do that only once:

Make a list out of several iterators

Just use lists (simple but inefficient)

allgrams = list(unigrams) + list(bigrams) + list(trigrams)

Or build a single list, properly

allgrams = list(unigrams)
allgrams.extend(bigrams)
allgrams.extend(trigrams)

Or use itertools.chain(), then make a list

allgrams = list(itertools.chain(unigrams, bigrams, trigrams))

The above produce identical results (as long as you don't try to reuse the iterators unigrams etc.-- you need to redefine them between examples).

Use the iterators themselves

Don't fight iterators, learn to work with them. Many Python functions accept them instead of lists, saving you much space and time.

You could form a single iterator and pass it to nltk.FreqDist():

fdist = nltk.FreqDist(itertools.chain(unigrams, bigrams, trigrams))

You can work with multiple iterators. FreqDist, like Counter, has an update() method you can use to count things incrementally:
```
fdist = nltk.FreqDist(unigrams)
fdist.update(bigrams)
fdist.update(trigrams)
```

although I accepted the above answer because it was the correct tool for this job, thank you for the extremely relevant information and explanation. I was struggling with generators and now I have a much clearer idea of different methods to use and join them. — owwoow14, Sep 28 '17 at 07:32

Merge generator objects to calculate frequency in NLTK

2 Answers2

Make a list out of several iterators

Use the iterators themselves