1

I'm trying to write the correct regular expression in PHP to parse a string (written by some user) to build a request. It can be as complex as :

name = 'benjo' and (surname = 'benny' or surname = 'bennie') or age = 4

Later I'll parse the string to build mySQL queries. For now, I'm just trying to find the correct regular expression to parse this string into an array that could look like :

$result = array(
0 => name = 'benjo',
1 => and
2 => array(
    0 => surname = 'benny',
    1 => or,
    2 => surname = 'bennie',
    ),
3 => age = 4
);

I've thought about using recursive functions, and my regular expression for now is :

"#\((([^()]+|(?R))*)\)|(ou[^()])|(et[^()])#",

which of course does not work.

I'll be glad if someone could help, I'm getting kinda stuck here ! :) Tks, Romain

LET'S CHANGE THE CHALLENGE ! :) OK, now let's make it a bit more simple. What would it take with a regular expression and adding the constraint that we stay on "level one" !! No nested parenthesis, just one level, but still as many AND/ORs... Would that change anything in favor or REGEXPs ? (I really would like to avoid writing my mini parser although that sounds really interesting...

Romain Bruckert
  • 2,546
  • 31
  • 50
  • 1
    what's the source of the original string? –  May 27 '12 at 08:15
  • It is an input ! (ps: i don't want to check input for bad code or injection at this stage!) ;) – Romain Bruckert May 27 '12 at 08:23
  • 3
    You're probably better off making a little mini-parser of your own rather than using regular expressions. – Corbin May 27 '12 at 08:24
  • Ok Corbin, but how can I build a mini parser without using regexp ?? I need to capture recursively what' both IN the parentesis, and the 'and' and 'or's ... – Romain Bruckert May 27 '12 at 08:25
  • 3
    Search for 'php recursive descent parser' - http://stackoverflow.com/questions/2245962/is-there-an-alternative-for-flex-bison-that-is-usable-on-8-bit-embedded-systems/2336769#2336769 – Ja͢ck May 27 '12 at 08:27
  • @Romain Bruckert: it can be done using any [FSM](http://en.wikipedia.org/wiki/Finite-state_machine) implementation or (more difficult but more robust) writing [lexer](http://en.wikipedia.org/wiki/Lexical_analysis) – zerkms May 27 '12 at 08:28
  • @zerkms With an FSM you’re back to regular expressions. This is inadequate here. – Konrad Rudolph May 27 '12 at 08:30
  • @Konrad Rudolph: FSM allows you to choose how to parse. And even with regexes - FSM is easier (from maintaining point of view) solution than just plain regexes, just because it is FSM – zerkms May 27 '12 at 08:31
  • @zerkms I don’t know what you mean. FSMs are perfectly expressed by regular expressions. Why would manually writing an FSM here help? How is that easier to maintain (it isn’t – changes in specs incur much larger changes in code). I think you are confusing them with a recursive descent parser. – Konrad Rudolph May 27 '12 at 08:33
  • @Konrad Rudolph: just because regular expressions are for regular languages, while FSMs can parse literally anything? – zerkms May 27 '12 at 08:34
  • @zerkms No, you are simply confusing things. FSM and regex are the same thing. – Konrad Rudolph May 27 '12 at 08:39
  • Wooo thanks to you all, now can't make my choice. I think I'll use some fixed pattern. the user will click on a radio button AND/OR to add his criterias and i'll force other inputs to build the request ! Isn't that the best solution ? ;) – Romain Bruckert May 27 '12 at 08:40
  • @Konrad Rudolph: regex is one of the possible datasources for FSM. I cannot get how "screw" can be the same thing as "car". PS: found this http://qntm.org/algo --- now I understand what you mean, but still don't agree that they are **the same thing**. They **may** express the same parse logic, but it doesn't make regex equal to FSM – zerkms May 27 '12 at 08:41
  • 2
    @zerkms Mathematically speaking, they are the same, just two different *representations*. I used the “same thing” shortcut since you didn’t seem to understand that FSMs aren’t more powerful than regular expressions. – Konrad Rudolph May 27 '12 at 08:44
  • @Konrad Rudolph: ok, seems like I just don't have enough math background for this discussion, sorry ;-) – zerkms May 27 '12 at 08:45

1 Answers1

4

Theoretical regular expression is not powerful enough to do parentheses matching. Theoretical regular expression can only take care of left recursion/right recursion rules. Middle recursion rules is cannot be expressed with regular expression (e.g. <exp> -> "(" <exp> ")").

Regex in programming languages, however, implements features which allow regex to exceed the power of regular grammar. For example, backreference in regex allows one to write a regex which matches a non context-free languages. However, even with backreference, it's still not possible to balance parentheses with regex.

As PCRE library supports recursive regex via subroutine call feature, it is technically possible to parse such an expression with regex. However, unless you can write the regex yourself, which means that you understand what you are doing and can modify the regex to suit your needs, you should just write your own parser. Otherwise, you will end up with an unmaintainable mess.

(?(DEFINE)
  (?<string>'[^']++')
  (?<int>\b\d+\b)
  (?<sp>\s*)
  (?<key>\b\w+\b)
  (?<value>(?&string)|(?&int))
  (?<exp>(?&key) (?&sp) = (?&sp) (?&value))
  (?<logic>\b (?:and|or) \b)
  (?<main>
    (?<token> \( (?&sp) (?&main) (?&sp) \) | (?&exp) )
    (?:
      (?&sp) (?&logic) (?&sp)
      (?&token) 
    )*
  )
)
(?:
  ^ (?&sp) (?= (?&main) (?&sp) $ )
  |
  (?!^) \G
  (?&sp) (?&logic) (?&sp)
)
(?:
  \( (?&sp) (?<m_main>(?&main)) (?&sp) \)
  |
  (?<m_key>(?&key)) (?&sp) = (?&sp) (?<m_value>(?&value))
)

Demo on regex101

The regex above should be use with preg_match_all, and placed between delimiter with x flag (free spacing mode): /.../x.

For each match:

  • If m_main capturing group has content, put the content through another round of matching.
  • Otherwise, get the key and value in m_key and m_value capturing group.

Explanation

The (?(DEFINE)...) block allows you to define named capturing groups for use in subroutine calls separately from the main pattern.

(?(DEFINE)
  (?<string>'[^']++')  # String literal
  (?<int>\b\d+\b)      # Integer
  (?<sp>\s*)           # Whitespaces between tokens
  (?<key>\b\w+\b)      # Field name
  (?<value>(?&string)|(?&int)) # Field value
  (?<exp>(?&key) (?&sp) = (?&sp) (?&value)) # Simple expression
  (?<logic>\b (?:and|or) \b) # Logical operators
  (?<main>             # <token> ( <logic> <token> )*
    # A token can contain a simple expression, or open a parentheses (...)
    # When we open a parentheses, we recurse into the main pattern again
    (?<token> \( (?&sp) (?&main) (?&sp) \) | (?&exp) )
    (?:
      (?&sp) (?&logic) (?&sp)
      (?&token) 
    )*
  )
)

The rest of the pattern is based on this technique to match all <token>s in <token> ( <logic> <token> )* with global matching operation.

The last part of the regex, while can be written as (?&token), is expanded to match the field name and value in the simple expressions.

(?:
  \( (?&sp) (?<m_main>(?&main)) (?&sp) \)
  |
  (?<m_key>(?&key)) (?&sp) = (?&sp) (?<m_value>(?&value))
)
Community
  • 1
  • 1
nhahtdh
  • 55,989
  • 15
  • 126
  • 162
  • 3
    Actually, PHP’s “regular” expressions *can* do parenthesis matching, and I think it might even parse OP’s expression but the result would be an arcane, unmaintainable mess. – Konrad Rudolph May 27 '12 at 08:30
  • Just curious, but can you show how you would check whether the parentheses are matched correctly in this case: `([()[()][[((()))]]])` with PHP regex? – nhahtdh May 27 '12 at 08:33
  • 1
    @nhahtdh: there is even an example of doing that in PHP's manual (hint: recursive regexes) – zerkms May 27 '12 at 08:35
  • Yes i got that from the manual, but i don't think it works for nested parentesis. – Romain Bruckert May 27 '12 at 08:38
  • @Romain Bruckert: it does - that is what *recursive* about – zerkms May 27 '12 at 08:46