I want to match a particular set of nested brackets from a grammatical parser's output (named Stanford Parser) as below.
(ROOT (S (NP (PRP He)) (VP (VBD gave) (NP (PRP me)) (NP (DT a) (NN pen))) (. .)))
(ROOT (S (NP (PRP He)) (VP (VBD said) (SBAR (IN that) (S (NP (PRP he)) (VP (VBD was) (ADJP (JJ hungry)))))) (. .)))
(ROOT (S (NP (PRP I)) (VP (VBD wrote) (NP (PRP him)) (NP (DT a) (JJ long) (NN letter))) (. .)))
(ROOT (S (NP (PRP He)) (VP (VBD provided) (NP (DT the) (JJ old) (NN bagger)) (NP (NP (DT a) (NN lot)) (PP (IN of) (NP (NN food))))) (. .)))
So want to match everything within the (VP...)
. But there are conditions:
(1) It should have 1 (VBD..)
and two (NP..)
afterwards. The VBD
is not a problem.(2) Two sets of NP
is the problem. The structure of an NP
bracket is not predictable. The only thing predictable is NP
and nested brackets like this (NP bla bla bla )
.
So I want to capture each NP
, which involves combining nested brackets with NP
. Below regex matches what I want (in this example at least), but it does not have (NP bla bla bla )
part defined. The half finished regex below does not contain this solution I seek, i.e. the NP part with all recursive bracket sub-nodes within it.
\(VP\s+\(V\w+([^()]+|(?<Level>\()|(?<-Level>\)))+(?(Level)(?!))\)
There is something about Balancing Group Definition here, that explains nesting brackets but it does not offer a solution for my problem.