1

I'm trying to use recursive regex to match bash-like variable expansions. Basically, I should be able to match strings like the following:

${FOO=BAR}
${FOO=${BAR=BAZ}}

But also handle inputs like this:

${FOO=\${BAR=BAZ}}
${FOO={${BAR=BAZ}}

In the first case, it should match up to, but excluding the final }, and the second case should match completely. This is because I'm trying to match the two-character opening ${ with the one-character closing }. Both the opening and closing should be able to be escaped.

I've gotten as far as the PCRE regex \$\{(?:[^{}]|(?R))*\}. But this doesn't handle escaping correctly. If I amend it to be (?:^|[^\\])(?:\\\\)*(\$\{(?:[^{}]|(?R))*\}), then only outermost escapes match correctly.

Can this be done with regex, or am I better off just writing a pyparsing parser?

Alex Reinking
  • 16,724
  • 5
  • 52
  • 86
  • What you are trying to match is a bit unclear, in particular the two last examples. Why the last example does not have balanced curly brackets? – Casimir et Hippolyte Oct 26 '14 at 21:21
  • The first example escapes the two-character opening sequence: I'm trying to match `${` with `}`, not just `{` and `}`. In the first case the whole match should be `${FOO=\${BAR=BAZ}`, and in the second, the whole string should match because the second `{` isn't actually opening anything. – Alex Reinking Oct 26 '14 at 21:23
  • I edited the question to make this clearer. – Alex Reinking Oct 26 '14 at 21:24
  • So you need to allow a single curly bracket (not preceded by a `$`) as content. – Casimir et Hippolyte Oct 26 '14 at 21:25
  • Not to be a nitpick, but is it possible you could see double `$$` or a stand alone `$` anywhere ? –  Oct 26 '14 at 22:07
  • `in the second, the whole string should match because the second { isn't actually opening anything` - Usually in any language there are several different balanced delimiter constructs. Usually these metacharacters don't stand alone, they are either escaped or quoted. –  Oct 26 '14 at 22:11

1 Answers1

3

You can try this pattern:

(?s)\\.(*SKIP)(*F)|(?s)(\${(?>[^$}\\]+|\\.|(?1))*})

online example

details:

(?s)
\\.               # an escaped character
(*SKIP)           # skip the matched content if the pattern fails later
(*F)              # force the pattern to fail
|
(?s)
(
    \${
    (?>           # open a atomic group
        [^$}\\]+  # all that is not a backslash, a $ or a }
      |           # OR
        \\.       # an escaped character
      |           # OR
        (?1)      # recurse to group 1
    )*            # repeat the atomic group zero or more times
    }               
)

The main idea is to avoid that an escaped dollar followed by an opening curly bracket is considered as an opening tag.

Notes: instead of using inline modifiers (?s) for each branchs, you can remove them and use the global modifier s.

To be fully rigorous, you can allow a $ not followed by an opening curly bracket in the content by adding the alternative \$(?!{) in the atomic group. (before the recursion)

About (*SKIP) and (*FAIL):

(*SKIP) and (*FAIL) are called backtracking control verbs.

When a pattern fails the default behaviour of an NFA regex engine is to use the backtracking mechanism. These verbs allow a control of this mechanism.

To be more concrete, the goal of the combo subpattern(*SKIP)(*FAIL) is to exclude the content matched by the subpattern from the match result, and to forbid the regex engine to try anything else with the matched substring. The substring is skipped.

You can see a full explanation here.

About atomic grouping:

An atomic group is a non-capturing group. The only difference is once the closing parenthesis reached, the regex engine is no more allowed to backtrack inside characters matched between parenthesis. It can only go to the position before the group. The atomic group makes the matched substring indivisible (atomic).

Here, the atomic group prevents catastrophic backtracking that may occurs with this kind of constructions (?:A+|B)+ if the pattern fails later.

Community
  • 1
  • 1
Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
  • Wow, that is neat! What, exactly, is an atomic group? I also didn't know about `(*SKIP)` or `(*FAIL)`. Can you explain this in just a little more detail? – Alex Reinking Oct 26 '14 at 21:40