Firstly, try to use (i) namespaces and (ii) unequivocal variable names, e.g.:
>>> from nltk import PCFG
>>> from nltk.parse import ViterbiParser
>>> import urllib.request
>>> response = urllib.request.urlopen('https://raw.githubusercontent.com/salmanahmad/6.863/master/Labs/Assignment5/Code/wsjp.cfg')
>>> wsjp = response.read().decode('utf8')
>>> grammar = PCFG.fromstring(wsjp)
>>> parser = ViterbiParser(grammar)
>>> list(parser.parse('turn off the lights'.split()))
[ProbabilisticTree('S', [ProbabilisticTree('VP', [ProbabilisticTree('VB', ['turn']) (p=0.002082678), ProbabilisticTree('PRT', [ProbabilisticTree('RP', ['off']) (p=0.1089101771)]) (p=0.10768769667270556), ProbabilisticTree('NP', [ProbabilisticTree('DT', ['the']) (p=0.7396712852), ProbabilisticTree('NNS', ['lights']) (p=4.61672e-05)]) (p=4.4236397464693323e-07)]) (p=1.0999324002161311e-13)]) (p=2.5385077255727538e-14)]
If we look at the grammar:
>>> grammar.check_coverage('please turn off the lights'.split())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.4/dist-packages/nltk/grammar.py", line 631, in check_coverage
"input words: %r." % missing)
ValueError: Grammar does not cover some of the input words: "'please'".
To resolve the unknown word issues, there're several options:
Use wildcard
non-terminals nodes to replace the unknown words. Find some way to replace the words that the grammar don't cover from check_coverage()
with the wildcard
, then parse the sentence with the wildcard
- this will usually decrease the parser's accuracy unless you have specifically train the PCFG with a grammar that handles unknown words and the wildcard is a superset of the unknown words.
Go back to your grammar production file that you have before creating the learning the PCFG with learn_pcfg.py
and add all possible words in the terminal productions.
Add the unknown words into your pcfg grammar and then renormalize the weights, given either very small weights to the unknown words (you can also try smarter smoothing/interpolation techniques)
Since this is a homework question I will not give the answer with the full code. But the hints above should be enough to resolve the problem.