2

TL;DR: Is there a way to specify a conditional so that an opening element MUST match its paired closing element?

Example is located on regex101.com.

=====

Balancing elements in regex is typically handled through recursion. This means that nested {...{...{...}...}...} can be located.

Also, PCRE allows the (?(DEFINE)...) construct, which lets you define various patterns without actually starting the match.

In the regular expression

# Define the opening and closing elements before the recursion occurs
(?(DEFINE)
  (?<open_curly>\{)
  (?<close_curly>\})
  # ... other definitions here ...

  (?<open>\g'open_curly')
  (?<close>\g'close_curly')
)

# Match the opening element
(\g'open'
  (?>
    # For recursion, don't match either the opening or closing element
    (?!\g'open'|\g'close')(?s:.)
  |
    # Recurse this captured pattern
    (?-1)
  )*

# Match the closing element
\g'close')

the elements are the { and } characters, and can match against patterns such as

{{{}}}
{ test1 { test2 { test3 { test4 } } } }

I want to include other open/close elements, such as [ and ], or --[ and --], so include those in the (?(DEFINE)):

(?<open_square>\[)
(?<close_square>\])
(?P<open_pascal>(?i:\bbegin\b))
(?P<close_pascal>(?i:\bend\b))
(?P<open_lua>--\[)
(?P<close_lua>--\])
(?<open>\g'open_curly'|\g'open_square'|\g'open_pascal'|\g'open_lua')
(?<close>\g'close_curly'|\g'close_square'|\g'close_pascal'|\g'close_lua')

What this DOESN'T do correctly is to pair the opening element with its closing element, allowing --[ to group with }, which is not desirable.

Is there a way to create open/close pairs in a regex like this?

OnlineCop
  • 4,019
  • 23
  • 35
  • In general case, regex's can't detect balanced elements, cause regex is a state machine, but you need a stack machine. – Mark Shevchenko Jan 07 '15 at 20:17
  • @Mark pattern matching libraries let you use recursion (which requires a stack) to allow matching nested constructs. See my answer. Some implementations (.NET) use balancing groups, which is also a stack. So you sure can match nested constructs with these patterns - only you can no longer call them *regular*. – Lucas Trzesniewski Jan 07 '15 at 20:34
  • Thanks. It's a new info for me. – Mark Shevchenko Jan 07 '15 at 20:42

2 Answers2

2

You got it wrong. You can achieve it with recursion like this:

(?(DEFINE)
  (?<curly>  \{        \g<content>*? \}      )
  (?<square> \[        \g<content>*? \]      )
  (?<pascal> \bbegin\b \g<content>*? \bend\b )
  (?<lua>    --\[      \g<content>*? --\]    )

  (?<nested> \g<curly> | \g<square> | \g<pascal> | \g<lua> )

  (?<content>
    # Math non-recursive content (atomically)
    (?: (?! [{}\[\]] | \bbegin\b | \bend\b | --[\[\]] ) . )++
    # Or recurse
    | \g<nested>
  )
)

\g<nested>

Demo

The trick here is to leverage the regex engine stack to keep a sort of memory of the matched opening symbol, in order to require the matching closing symbol when the regex engine unwinds the stack as it leaves the recursion. You also need to make sure the catch-all . doesn't consume a group starting/ending symbol.


Simpler cases are more readable:

(?(DEFINE)
  (?<curly>  \{ \g<content>*? \} )
  (?<square> \[ \g<content>*? \] )

  (?<nested> \g<curly> | \g<square> )

  (?<content> [^{}\[\]]++ | \g<nested> )
)

\g<nested>

Demo

Notice the atomic group. It prevents excessive backtracking. Also, this regex instantly consumes anything that can't be recursive without trying to recurse into it first.


Also, to answer your question, conditionals are of no use here, since you need a stack to keep track of the ending symbol you need to match.

Some implementations (PCRE) let you use a stack by means of recursive patterns, while others (.NET) expose a stack through balancing groups. But when you have several kinds of balanced constructs, recursion is the clear winner.

Community
  • 1
  • 1
Lucas Trzesniewski
  • 50,214
  • 11
  • 107
  • 158
  • I'm loving the fact that there is no explicit recursion call, but it gets recursed based on the groups referencing each other like that. I agree, the single-character elements *are* more optimized since they don't need to backtrack. I'm curious as to why the `(?...)` line needs to re-specify the elements that are to be negated, instead of reusing the `\g`-style construct. – OnlineCop Jan 07 '15 at 20:52
  • @OnlineCop The thing is... `\g` **is** an explicit recursion, only in this case it's a mutual recursion :) this is why you can't re-use `\g` to negate anything inside `content` - it would recurse. Also, I optimized the first case at the cost of readability. – Lucas Trzesniewski Jan 07 '15 at 21:22
1

I would say that there is no use polluting the logic with a bunch of named groups and
unnecessary recursion that is unpredictable.

There are three main parts to maintaining a proper recursion (listed below).
To do it right, you must parse everything, so you have to take into account
content and unbalanced errors.

The engine won't let you capture in detail, anything beyond the first level.
That means its easier to maintain a TOP that you can pull info from and a CORE
that you can't maintain, but does the recursion. A few subtle differences but can be seen
in the below examples.

Anytime a core is recursed, it is immediatly surrounded by an individual
set (pair) of unique delimiters. This is to unwind the stack properly.
This process can't be factored out (generalized).

update
Usually this regex is called within a recursive function, passing the CORE to it each time.
Example pseudo-code:

bool bIsOk = true;
bool RecurseCore( string core )
{  
     while( regex_search ( core, regex, match ) )
     {
          if ( match[1].success ) { print 'content' }
          else
          if ( match[2].success ) { print 'square'; bIsOk = RecurseCore( match[2].value ) }
          else
          if ( match[3].success ) { print 'curly'; bIsOk = RecurseCore( match[3].value ) }
          else
          if ( match[4].success ) { print 'pascal'; bIsOk = RecurseCore( match[4].value )  }
          else
          if ( match[5].success ) { print 'lua'; bIsOk = RecurseCore( match[5].value )   }
          else
          if ( match[6].success ) { print 'error'; bIsOk = false } // error
          if ( bIsOk == false ) { break }
     }
     return bIsOk;
 }

Regex:

 # //////////////////////////////////////////////////////
 # // The General Guide to 3-Part Recursive Parsing
 # // ----------------------------------------------
 # // Part 1. CONTENT
 # // Part 2. CORE
 # // Part 3. ERRORS

 (?si)                      # Dot all, no case

 (?:
      (                          # (1), Take off CONTENT
           (?&content) 
      )
   |                           # OR
      \[                         # Square's delimiter
      (                          # (2), square CORE
           (?= . )
           (?&core) 
        |  
      )
      \]                         # End-Delimiter
   |                           # OR
      \{                         # Curly's delimiter
      (                          # (3), curly CORE
           (?= . )
           (?&core) 
        |  
      )
      \}                         # End-Delimiter
   |                           # OR
      \b begin \b                # Pascal's delimiter
      (                          # (4), pascal CORE
           (?= . )
           (?&core) 
        |  
      )
      \b end \b                  # End-Delimiter
   |                           # OR  
      --\[                       # Lua's delimiter
      (                          # (5), lua CORE
           (?= . )
           (?&core) 
        |  
      )
      --\]                       # End-Delimiter
   |                           # OR
      (                          # (6), Take off Unbalanced (delimeter) ERRORS
           \b 
           (?: begin | end )
           \b 
        |  -- [\[\]] 
        |  [\[\]{}] 
      )
 )

 # ///////////////////////
 # // Subroutines
 # // ---------------

 (?(DEFINE)

      # core
      (?<core>
           (?>
                (?&content) 
             |  
                \[                         
                (?:                        # Square delimiter
                     (?= . )                    # recurse core
                     (?&core)                   
                  |  
                )
                \]
             |                           # OR
                \{
                (?:                        # Curly delimiter
                     (?= . )                    # recurse core 
                     (?&core) 
                  |  
                )
                \}      
             |                           # OR
                \b begin \b 
                (?:                        # Pascal delimiter
                     (?= . )                    # recurse core 
                     (?&core) 
                  |  
                )
                \b end \b     
             |                           # OR
                --\[
                (?:                        # Lua delimiter
                     (?= . )                    # recurse core 
                     (?&core) 
                  |  
                )
                --\]      
           )+
      )

      # content 
      (?<content>
           (?>
                (?!
                     \b 
                     (?: begin | end )
                     \b 
                  |  -- [\[\]] 
                  |  [\[\]{}] 
                )
                . 
           )+
      )

 )