8

I'm trying to parse string in the following format (EBNF, I hope this is right) in PHP:

<exp>      ::= <base>[{<modifier>["!"]"("<exp>")"}]
<base>     ::= <role>[{<modifier><role>}]
<modifier> ::= "&" | "|"
<role>     ::= ["!"]<str>[","<str>]

Where <str> is any string that would pass [a-zA-Z0-9\-]+

The following are example of patterns that would have to be parsed:

token1
token1&token2
token1|(token2&!token3)
(token1&token2)|(token3&(token4|(!token5,12&token6)))
!(token1&token2|(token3&!token4))|token5,12

I am trying to write a RegEx pattern that would always give me four groups:

  1. The left-most <expression>. From the above example this would be:
    • token1
    • token1
    • token1
    • token1&token2
    • token1&token2|(token3&!token4)
  2. If ["!"] was present. I.e.
    • null
    • null
    • null
    • null
    • !
  3. The <modifier> for the next <expression> (if any). This would be:
    • null
    • &
    • |
    • |
    • |
  4. The remaining of the pattern.
    • null
    • token2
    • token2&!token3
    • token3&(token4|(!token5,12&token6))
    • token5,12

I can parse this provided that the first expression doesn't contain any <modifier>s.

^\(?(!?)([a-zA-Z0-9\-]+)\)?([&|]?)(.*)$

I am stuck at this point. I have tried using lookarounds, however I can't figure out how to ensure that the group is captured when all brackets are balanced. Is this achievable with RegEx or do I need to write code using loops etc. to do this?

Bart Platak
  • 4,387
  • 5
  • 27
  • 47
  • +1 also, nice job, will be following this for responses – Ionut Flavius Pogacian Jul 07 '12 at 20:49
  • You can almost literally translate your description into a PCRE `?(DEFINE)` block. However, PCRE only allows matching, not parsing. You won't get the split up result list as you'd want. (Unless you also use a recursive `preg_replace_callback` to collect all tokens.) – mario Jul 07 '12 at 21:05

1 Answers1

1

As far as I know, it is impossible.

You have a context-free grammar (EBNF is for this type of grammars - Type-2 grammars), which cannot be parsed with regular expressions (which are for regular grammars - Type-3 grammars).

http://en.wikipedia.org/wiki/Chomsky_hierarchy

As an example of the thing you cannot handle here: number of opening parantheses - you can only write one regexp for each number of these (but there can be infinite, right?), otherwise there is no way to tell if the number of matching closing parantheses is the same. There is no way to count how many chars mathed by the specific part of regexp with quantifiers (+, *, etc.)

scriptin
  • 3,060
  • 1
  • 24
  • 33
  • PHP's regexes do allow recursion; but I would recommend against it. – Evert Jul 07 '12 at 21:26
  • @Evert Yes, but this doesn't help. Example: `"$m=array(); preg_match('/ < ( (?>[a-z]+) | (?R) )* > /x', '', $m); var_dump($m);"` - this works because regexp doesn't match the whole string - no `^` and `$`. BUT, if we add `^` and `$`, it would NOT work even for `'>'` as input string, because `(?R)` now contains `^` and `$`. I hope this explanation is comprehensible. – scriptin Jul 07 '12 at 22:18
  • I have looked into the recursion but it seems like a really messy way to do it. Instead I am now writing my own parser to deal with this.. Thanks :) – Bart Platak Jul 07 '12 at 23:00
  • 4
    Not correct. Just because they are called regular expressions, [does not mean](http://stackoverflow.com/questions/2255403/why-is-recursive-regex-not-regex) they are limited to regular languages. And FYI you can also invoke a particular subexpression recursively `(?1)` thus excluding `^` and `$`. – mario Jul 08 '12 at 00:32