Regex to extract Noun Phrases from a Part of Speech parse tree?

Question

I am trying to extract all three word noun phrases from a Stanford POS Parse Tree. Basically, anything that looks like:

(NP (TAG WORD) (TAG WORD) (TAG WORD))

Or:

(NP (TAG WORD) (TAG (TAG WORD) (TAG WORD)))

This is what a parse tree can look like:

(ROOT (SQ (VBZ Is) (NP (DT this)) (NP (DT an) (NN asthma) (NN attack)) (. ?)))

When I do this regex, it extracts the correct 3 word noun phrase:

threeWordNounPhrases = full.scan(/\(NP \([^()]+ [^()]+\) \([^()]+ [^()]+\)\)/)
# => "(NP (DT an) (NN asthma) (NN attack))"

However, this does not work for something like:

(ROOT (SQ (NNP Should) (NP (PRP I)) (VP (VB watch) (NP (NP (NNP Game)) (PP (IN of) (NP (NNP Thrones)))) ) (. ?)))

Which should return:

(NP (NP (NNP Game)) (PP (IN of) (NP (NNP Thrones))))

Don't use regular expressions for free-form, unstructured content. Have you considered using an NLP gem, [such as `treat`](https://github.com/louismullie/treat/wiki/Quick-Tour)? — yurisich, Mar 08 '16 at 22:17
There also exist Ruby bindings for the Stanford Core NLP package: https://github.com/louismullie/stanford-core-nlp — Jordan Running, Mar 08 '16 at 22:41
Not solvable reliably with regex. Parse (or otherwise get) the data as an actual tree. — ivan_pozdeev, Mar 09 '16 at 05:26
Possible duplicate of [Are there particular cases where native text manipulation is more desirable than regex?](http://stackoverflow.com/questions/1038186/are-there-particular-cases-where-native-text-manipulation-is-more-desirable-than) — ivan_pozdeev, Mar 09 '16 at 05:28
Or http://stackoverflow.com/questions/14708047/how-to-extract-the-noun-phrases-using-open-nlps-chunking-parser — ivan_pozdeev, Mar 09 '16 at 05:33

score 2 · Answer 1 · answered Mar 09 '16 at 01:21

Specifically for three words, it is possible, but not pretty. For N words, the complexity of the regexp rises. Note that this is just for fun (and regexp/Oniguruma education); in reality, I'd suggest to go with what everyone else says: use a tree parsing library and manipulate the tree.

str = "(ROOT (SQ (NNP Should) (NP (PRP I)) (VP (VB watch) (NP (NP (NNP Game)) (PP (IN of) (NP (NNP Thrones)))) ) (. ?)))"

re = /
  (?<tag>
    [A-Z]+
  ){0}

  (?<word>
    \( \g<tag> \s
    (?:
      [^()]+ |
      \g<word>
    )
    \)
  ){0}

  (?<word2>
    \g<word> \s \g<word> |
    \( \g<tag> \s \g<word2> \)
  ){0}

  (?<word3>
    \g<word> \s \g<word> \s \g<word> |
    \g<word2> \s \g<word> |
    \g<word> \s \g<word2> |
    \( \g<tag> \s \g<word3> \)
  ){0}
  \( NP \s \g<word3> \)
/x;

puts str[re]
# => (NP (NP (NNP Game)) (PP (IN of) (NP (NNP Thrones))))

score 0 · Answer 2 · answered Mar 08 '16 at 23:00

I don't see a way to use regular expressions unless are able to consider all possible structures. What you did works for simple cases, but as you found, it fails with deeper, nested structures. I see two options:

From the point where you encounter (NP in the text, read in additional characters. Keep a running tally of parenthesis. Add to it when you see (, subtract when you see ). When you get to zero, you've reached the end of the NP.
Parse the tree using rubytree. Extract all subtrees that are dominated by a node with label of NP. Convert the subtree back to string form by concatenating leaf nodes.

Regex to extract Noun Phrases from a Part of Speech parse tree?

2 Answers2