Theoretical regular expression is not powerful enough to do parentheses matching. Theoretical regular expression can only take care of left recursion/right recursion rules. Middle recursion rules is cannot be expressed with regular expression (e.g. <exp> -> "(" <exp> ")"
).
Regex in programming languages, however, implements features which allow regex to exceed the power of regular grammar. For example, backreference in regex allows one to write a regex which matches a non context-free languages. However, even with backreference, it's still not possible to balance parentheses with regex.
As PCRE library supports recursive regex via subroutine call feature, it is technically possible to parse such an expression with regex. However, unless you can write the regex yourself, which means that you understand what you are doing and can modify the regex to suit your needs, you should just write your own parser. Otherwise, you will end up with an unmaintainable mess.
(?(DEFINE)
(?<string>'[^']++')
(?<int>\b\d+\b)
(?<sp>\s*)
(?<key>\b\w+\b)
(?<value>(?&string)|(?&int))
(?<exp>(?&key) (?&sp) = (?&sp) (?&value))
(?<logic>\b (?:and|or) \b)
(?<main>
(?<token> \( (?&sp) (?&main) (?&sp) \) | (?&exp) )
(?:
(?&sp) (?&logic) (?&sp)
(?&token)
)*
)
)
(?:
^ (?&sp) (?= (?&main) (?&sp) $ )
|
(?!^) \G
(?&sp) (?&logic) (?&sp)
)
(?:
\( (?&sp) (?<m_main>(?&main)) (?&sp) \)
|
(?<m_key>(?&key)) (?&sp) = (?&sp) (?<m_value>(?&value))
)
Demo on regex101
The regex above should be use with preg_match_all
, and placed between delimiter with x flag (free spacing mode): /.../x
.
For each match:
- If
m_main
capturing group has content, put the content through another round of matching.
- Otherwise, get the key and value in
m_key
and m_value
capturing group.
Explanation
The (?(DEFINE)...)
block allows you to define named capturing groups for use in subroutine calls separately from the main pattern.
(?(DEFINE)
(?<string>'[^']++') # String literal
(?<int>\b\d+\b) # Integer
(?<sp>\s*) # Whitespaces between tokens
(?<key>\b\w+\b) # Field name
(?<value>(?&string)|(?&int)) # Field value
(?<exp>(?&key) (?&sp) = (?&sp) (?&value)) # Simple expression
(?<logic>\b (?:and|or) \b) # Logical operators
(?<main> # <token> ( <logic> <token> )*
# A token can contain a simple expression, or open a parentheses (...)
# When we open a parentheses, we recurse into the main pattern again
(?<token> \( (?&sp) (?&main) (?&sp) \) | (?&exp) )
(?:
(?&sp) (?&logic) (?&sp)
(?&token)
)*
)
)
The rest of the pattern is based on this technique to match all <token>
s in <token> ( <logic> <token> )*
with global matching operation.
The last part of the regex, while can be written as (?&token)
, is expanded to match the field name and value in the simple expressions.
(?:
\( (?&sp) (?<m_main>(?&main)) (?&sp) \)
|
(?<m_key>(?&key)) (?&sp) = (?&sp) (?<m_value>(?&value))
)