NTLK nltk.ConditionalFreqDist - Plot ngrams

Question

Here are two examples, one that works and is derived from the https://www.nltk.org/book/ch02.html and another that does not. The first examples plots single words frequencies, here ['america', 'citizen']. The second is a modified version (evidently incorrectly) that attempts to plot frequencies of the bigram ['america citizen']. I would like to plot ngram frequencies such as for a bigram like ['america citizen'].

Plot Example 1 Plot Example 2 - failed

import nltk
from nltk.book import *
import matplotlib.pyplot as plt
from nltk.corpus import inaugural
inaugural.fileids()
plt.ion() # turns interactive mode on
[fileid[:4] for fileid in inaugural.fileids()]



############- this works ####
cfd = nltk.ConditionalFreqDist(
     (target, fileid[:4])
     for fileid in inaugural.fileids()
     for w in inaugural.words(fileid)
     for target in ['america', 'citizen']
     if w.lower().startswith(target)) 
ax = plt.axes()
cfd.plot()

############- this does not work ####

cfd = nltk.ConditionalFreqDist(
     (target, fileid[:4])
     for fileid in inaugural.fileids()
     for w in inaugural.words(fileid)
     for target in ['american citizen']
     if w.lower().startswith(target)) 
ax = plt.axes()
cfd.plot()

Thank's for the first question. Looks OK. Maybe attaching plot will attract more people to try to help. — Łukasz Ślusarczyk, Feb 28 '20 at 19:35
Welcome to Stack Overflow! It would be great if you could format your question poperly (the code won't run as-is) and provide returned results / stack trace of error you are getting. Please read on [Minimal, Workable Examples](https://stackoverflow.com/help/minimal-reproducible-example) if unsure. — sophros, Feb 29 '20 at 11:39

sophros · Accepted Answer · 2020-03-02T16:30:44.583

0

It seems to me that you are trying to find 'american citizen' which is a collocation comprised of 2 words looking among single words. This is bound to fail. You would have to check for such a bigram among pairs of consecutive words and for that, you need to zip the lists of words shifting the second by 1 word.

The key difference in your code (you can add more collocations in the form of pairs of words to the list of the last for):

def zip2(lst):
   ilst = iter(lst)
   _ = next(ilst)  # drop the first element
   return zip(lst, ilst)


cfd = nltk.ConditionalFreqDist(
     (t1 + ' ' + t2, fileid[:4])
        for fileid in inaugural.fileids()
            for w1, w2 in zip2(inaugural.words(fileid))
                for t1, t2 in [('american', 'citizen',)]
                    if w1.lower().startswith(t1) and w2.lower().startswith(t2)
     )
ax = plt.axes()
cfd.plot()

edited Mar 02 '20 at 16:30

answered Feb 29 '20 at 12:14

sophros

14,672
11
46
75

Hi Sophros, Thanks for your thoughtful response. I understand the problem and how you suggest to address it. For some reason in your example, the line `for w1, w2 in zip(inaugural.words(fileid), next(inaugural.words(fileid)))` does not produce a valid iterator. The issue seems related to be the `next()` part of it as i get ` File "", line 14, in for w1, w2 in zip(inaugural.words(fileid), next(inaugural.words(fileid))) TypeError: 'StreamBackedCorpusView' object is not an iterator. The code without the `next()` runs OK. Am i missing something? – drstvnm Mar 02 '20 at 02:50
I suppose i could brute force the idea and shift the list like this `list1 = [[],'My', 'fellow', 'citizens',] list2 = ['My', 'fellow', 'citizens',] dlist = zip(list1,list2) list(dlist) list1 = [[],'My', 'fellow', 'citizens',] list2 = ['My', 'fellow', 'citizens'] dlist = zip(list1,list2) list(dlist) Out[11]: [([], 'My'), ('My', 'fellow'), ('fellow', 'citizens')]` but your solution seems more appealing if i can make it work. – drstvnm Mar 02 '20 at 03:05
I missed one typo when correcting it in the answer. Try `next(iter(inaugural.words(fileid)))` as is in the now corrected answer. – sophros Mar 02 '20 at 04:12
Thanks again. That does not quite do what i think you are suggesting. If you look at this little example you see that the `next()` iterates over elements of the strings and not words. `list1 = ['My', 'fellow', 'citizens',] dlist = zip(list1, next(iter(list1))) list(dlist) Out[12]: [('My', 'M'), ('fellow', 'y')]` – drstvnm Mar 02 '20 at 15:16
@drstvnm: indeed. This was an error on my side. Please see the corrected answer, which I checked. – sophros Mar 02 '20 at 16:32

NTLK nltk.ConditionalFreqDist - Plot ngrams

1 Answers1