0

I want to match a particular set of nested brackets from a grammatical parser's output (named Stanford Parser) as below.

(ROOT (S (NP (PRP He)) (VP (VBD gave) (NP (PRP me)) (NP (DT a) (NN pen))) (. .)))
(ROOT (S (NP (PRP He)) (VP (VBD said) (SBAR (IN that) (S (NP (PRP he)) (VP (VBD was) (ADJP (JJ hungry)))))) (. .)))
(ROOT (S (NP (PRP I)) (VP (VBD wrote) (NP (PRP him)) (NP (DT a) (JJ long) (NN letter))) (. .)))
(ROOT (S (NP (PRP He)) (VP (VBD provided) (NP (DT the) (JJ old) (NN bagger)) (NP (NP (DT a) (NN lot)) (PP (IN of) (NP (NN food))))) (. .)))

So want to match everything within the (VP...). But there are conditions: (1) It should have 1 (VBD..)and two (NP..) afterwards. The VBD is not a problem.(2) Two sets of NP is the problem. The structure of an NP bracket is not predictable. The only thing predictable is NP and nested brackets like this (NP bla bla bla ). So I want to capture each NP, which involves combining nested brackets with NP. Below regex matches what I want (in this example at least), but it does not have (NP bla bla bla ) part defined. The half finished regex below does not contain this solution I seek, i.e. the NP part with all recursive bracket sub-nodes within it.

\(VP\s+\(V\w+([^()]+|(?<Level>\()|(?<-Level>\)))+(?(Level)(?!))\)

There is something about Balancing Group Definition here, that explains nesting brackets but it does not offer a solution for my problem.

Shakir
  • 343
  • 5
  • 23
  • What are expected results here? Only one indicated above? – Wiktor Stribiżew Apr 18 '17 at 15:44
  • I got 3 matches: `(VP (VBD gave) (NP (PRP me)) (NP (DT a) (NN pen)))`, `(VP (VBD wrote) (NP (PRP him)) (NP (DT a) (JJ long) (NN letter)))`, `(VP (VBD provided) (NP (DT the) (JJ old) (NN bagger)) (NP (NP (DT a) (NN lot)) (PP (IN of) (NP (NN food)))))` - are those expected? – Wiktor Stribiżew Apr 18 '17 at 15:59
  • I have edited the question hopefully now it is clear. – Shakir Apr 18 '17 at 17:03
  • I have edited it again so clarify that I could not manage to define a recursive regex that can match all NPs with a VP. The above example just matches a VP (all of it without considering 2 NPs being there or not there). – Shakir Apr 19 '17 at 11:09
  • Yeah. It is actually. These examples are very simple involving no nesting of, for example, (VP....(VP....)). So I was kind of hoping to pick up everything between top most (VP....) node and then check if two (NP...) nodes existed. Like it happens in html (e.g. HTMLAgilitypack). But that does not seem to be possible with regex. Unable to understand how can I define start and end of something like (VP....) when there can be sub-nodes with the same (VPs within VPs, and NPs within NPs). Looks impossible with regex. – Shakir Apr 19 '17 at 11:22
  • I have this regex - [`\(VP\s+\(VBD\s+[^()]+\)(?:\s+\((?:(?:(?NP)|[^()])|(?\()|(?<-l>\)))*(?(l)(?!))\))+(?<-n>){2}(?(n)(?!))\)`](http://regexstorm.net/tester?p=%5c%28VP%5cs%2b%5c%28VBD%5cs%2b%5b%5e%28%29%5d%2b%5c%29%28%3f%3a%5cs%2b%5c%28%28%3f%3a%28%3f%3a%28%3f%3cn%3eNP%29%7c%5b%5e%28%29%5d%29%7c%28%3f%3cl%3e%5c%28%29%7c%28%3f%3c-l%3e%5c%29%29%29*%28%3f%28l%29%28%3f!%29%29%5c%29%29%2b%28%3f%3c-n%3e%29%7b2%7d%28%3f%28n%29%28%3f!%29%29%5c%29&i=%28ROOT+%28S+%28NP+%28PRP+I%29%29+%28VP+%28VBD+wrote%29+%28NP+%28PRP+him%29%29+%28NP+%28DT+a%29+%28JJ+long%29+%28NN+letter%29%29%29+%28). Pls check. – Wiktor Stribiżew Apr 19 '17 at 11:25
  • Thanks. I will put it into a C# program and check how does it work. Because the online matching tool breaks at [this example](https://expirebox.com/download/4aa7168c2a64d3efa97b77c23197072e.html) with this regex. As I said, very messy data. – Shakir Apr 19 '17 at 11:40
  • Could you separate the part that matches (NP and everything within it recursively until the closing ')' ? I could learn from it and find a better solution as I (being a linguist) can do ti better myself and probably not good explaining to a non-linguist. – Shakir Apr 19 '17 at 11:45
  • Try with [`\(VP\s+\(VBD\s+[^()]+\)(?:\s+\((?>(?:(?NP)|[^()])|(?\()|(?<-l>\)))*(?(l)(?!))\))+(?<-n>){2}(?(n)(?!))\)`](http://regexstorm.net/tester?p=%5c%28VP%5cs%2b%5c%28VBD%5cs%2b%5b%5e%28%29%5d%2b%5c%29%28%3f%3a%5cs%2b%5c%28%28%3f%3e%28%3f%3a%28%3f%3cn%3eNP%29%7c%5b%5e%28%29%5d%29%7c%28%3f%3cl%3e%5c%28%29%7c%28%3f%3c-l%3e%5c%29%29%29*%28%3f%28l%29%28%3f!%29%29%5c%29%29%2b%28%3f%3c-n%3e%29%7b2%7d%28%3f%28n%29%28%3f!%29%29%5c%29&i=%28ROOT+%28S+%28NP+%28PRP+I%29%29+%28VP+%28VBD+wrote%29+%28NP+%28PRP+him%29%29+%28NP+%28DT+a%29+%28JJ+long%29+%28NN+letter%29%29%29+%28) – Wiktor Stribiżew Apr 19 '17 at 11:52
  • Apparently second last (?‌​!) (before {2} is not right). – Shakir Apr 19 '17 at 11:57
  • Why? It is right. It checks if the `()` are balanced inside `(VP...)`. – Wiktor Stribiżew Apr 19 '17 at 12:01
  • Well, Expresso (the .net regex build tool) and regex tester online both say there is an error in grouping. Thanks a lot for the help. I am gonna study it and see what can do to make it work. – Shakir Apr 19 '17 at 12:09
  • You definitely copied the regex from the comment above. See [Expresso screenshot](http://imgur.com/a/R25uY) - all is working well. Copy the regex from the regexstorm, SO adds rubbish chars into comments. – Wiktor Stribiżew Apr 19 '17 at 12:13
  • I was about to post, but there is another problem I have not yet asked for clarification: there is a match of `(VP (VBD implemented) (PP (IN on) (NP (NNP May) (CD 28) (, ,) (CD 2014))))` - should it be matched? The `NP` count is OK due to the presence of `NNP`. Do you need to check for 2 `NP` as **whole words** or not? – Wiktor Stribiżew Apr 19 '17 at 13:06
  • I see, there was problem with copying. The thing I am trying to match is called ditransitive verbs, something like this (VP.....(NP) (NP)). So a (PP..) is extra here. – Shakir Apr 19 '17 at 18:43
  • Just [replace `NP` with `\bNP\b`](http://regexstorm.net/tester?p=%5c%28VP%5cs%2b%5c%28VBD%5cs%2b%5b%5e%28%29%5d%2b%5c%29%28%3f%3a%5cs%2b%5c%28%28%3f%3e%28%3f%3a%28%3f%3cn%3e%5cbNP%5cb%29%7c%5b%5e%28%29%5d%29%7c%28%3f%3cl%3e%5c%28%29%7c%28%3f%3c-l%3e%5c%29%29%29*%28%3f%28l%29%28%3f!%29%29%5c%29%29%2b%28%3f%3c-n%3e%29%7b2%7d%28%3f%28n%29%28%3f!%29%29%5c%29&i=%28ROOT+%28S+%28NP+%28PRP+I%29%29+%28VP+%28VBD+wrote%29+%28NP+%28PRP+him%29%29+%28NP+%28DT+a%29+%28JJ+long%29+%28NN+letter%29%29%29+%28). – Wiktor Stribiżew Apr 19 '17 at 18:48
  • Does [this work](http://regexstorm.net/tester?p=%5c%28VP%5cs%2b%5c%28VBD%5cs%2b%5b%5e%28%29%5d%2b%5c%29%28%3f%3a%5cs%2b%5c%28%28%3f%3e%28%3f%3a%28%3f%3cn%3e%5cbNP%5cb%29%7c%5b%5e%28%29%5d%29%7c%28%3f%3cl%3e%5c%28%29%7c%28%3f%3c-l%3e%5c%29%29%29%2A%28%3f%28l%29%28%3f!%29%29%5c%29%29%2b%28%3f%3c-n%3e%29%7b2%7d%28%3f%28n%29%28%3f!%29%29%5c%29&i=%28ROOT+%28S+%28NP+%28PRP+I%29%29+%28VP+%28VBD+wrote%29+%28NP+%28PRP+him%29%29+%28NP+%28DT+a%29+%28JJ+long%29+%28NN+letter%29%29%29+%28)? – Wiktor Stribiżew Apr 21 '17 at 06:51
  • 1
    I have tried it, unfortunately there is a lot of noise (unwanted matches). I think there has a lot to be addressed. The real data is complex and I will have to use string splitting + regex; or seek an xml output which can be processed by something like HTMLAgilitypack. – Shakir Apr 22 '17 at 13:24

2 Answers2

0

Well, I'm not sure I really understood what exactly you wanted, but I'll give it a try. :)

\(VP.*\(V(\w{1,2}).*\(NP.*\){2}\) This matches 4 times with your given example and the one special case you wanted.

You might want to check out regexpal.com to check for yourself.

Edit: I used . (dot) a lot, you might want to be a little stricter.

Morfium
  • 241
  • 4
  • 7
0

Nope, sorry. Regex is incredibly useful but you're asking for something that it can't do. Regex is a "deterministic finite automaton" and doesn't have the ability to do counting: https://en.wikipedia.org/wiki/Deterministic_finite_automaton

So, what you probably want is a simple recursive descent parser that will let you match up parentheses recursively. It's probably less work than what you've expended trying to make regex work, especially for as simple a grammar as you have. For a description and example you can start here: https://en.wikipedia.org/wiki/Recursive_descent_parser

(Hey, what do you know! Those computer science classes turned out to be useful!)

Ray Fischer
  • 936
  • 7
  • 9
  • I was more wondering that nested brackets matching solution already discussed [here](https://msdn.microsoft.com/en-us/library/bs2twtah.aspx#balancing_group_definition) and [here](http://stackoverflow.com/questions/19693622/how-to-get-text-between-nested-parentheses) could help me. The nested brackets matching is already possible, I wanted to look for a way to limit it by labels like only nesting between brackets having an NP at the start and within it no matter how many nested brackets but stopping when closing brackets occurred. Looks like no counting is possible. – Shakir Apr 19 '17 at 11:13
  • 1
    .NET regex *can* "count", `(?)` pushes the value on to the stack (yes, .NET regex supports the *stack*) and `(?<-l>)` will pop the value from the stack. Then all you need is to check if the stack is empty with a conditional construct. Computer science clases have got nothing to do with .NET regex ability to match nested structures. – Wiktor Stribiżew Apr 19 '17 at 11:27
  • Thanks. I will have to learn it. I am unfortunately only an amateur programmer with a linguistics background (no computer science classes ever). – Shakir Apr 19 '17 at 11:38
  • Matching nested structures isn't the issue, so long as you know how deeply they are nested. The problem is in matching an arbitrary nesting where you do not know at the beginning how deeply the brackets/parens are nested. The former you can do with regex. The latter you cannot. – Ray Fischer Apr 19 '17 at 21:34