I am trying to extract all three word noun phrases from a Stanford POS Parse Tree. Basically, anything that looks like:
(NP (TAG WORD) (TAG WORD) (TAG WORD))
Or:
(NP (TAG WORD) (TAG (TAG WORD) (TAG WORD)))
This is what a parse tree can look like:
(ROOT (SQ (VBZ Is) (NP (DT this)) (NP (DT an) (NN asthma) (NN attack)) (. ?)))
When I do this regex, it extracts the correct 3 word noun phrase:
threeWordNounPhrases = full.scan(/\(NP \([^()]+ [^()]+\) \([^()]+ [^()]+\)\)/)
# => "(NP (DT an) (NN asthma) (NN attack))"
However, this does not work for something like:
(ROOT (SQ (NNP Should) (NP (PRP I)) (VP (VB watch) (NP (NP (NNP Game)) (PP (IN of) (NP (NNP Thrones)))) ) (. ?)))
Which should return:
(NP (NP (NNP Game)) (PP (IN of) (NP (NNP Thrones))))