I am trying to count frequency of various ngrams
using ngram
and freqDist
functions in nltk
.
Due to the fact that the ngram
function output is a generator
object, I would like to merge the output from each ngram before calculating frequency.
However, I am running into problems to merge the various generator objects.
I have tried itertools.chain
, which created an itertools
object, rather than merge the generators.
I have finally settled on permutations
, but to parse the objects afterwards seems redundant.
The working code thus far is:
import nltk
from nltk import word_tokenize, pos_tag
from nltk.collocations import *
from itertools import *
from nltk.util import ngrams
import re
corpus = 'testing sentences to see if if if this works'
token = word_tokenize(corpus)
unigrams = ngrams(token,1)
bigrams = ngrams(token,2)
trigrams = ngrams(token,3)
perms = list(permutations([unigrams,bigrams,trigrams]))
fdist = nltk.FreqDist(perms)
for x,y in fdist.items():
for k in x:
for v in k:
words = '_'.join(v)
print words, y
As you can see in the results, freq dist is not calculating the words from the individual generator objects properly as each has a frequency of 1. Is there a more pythonic way to do properly do this?