1

Background: I'm just fiddling around with an idea for simple templating which only provides if/for/render, to see how feasible it is and if it makes sense to use in my personal project. As opposed to using NVelocity or Razor, or anything else.

I've written a regular expression:

(?:(?:(?<open>\[if (?<if>[a-zA-Z0-9\.]+)\])(?<content>[^\[]*))+(?:[^\[]*(?<close-open>\[end if\]))+)+(?(open)(?!))

And when used with the sample text:

<div>
[if variable3]{{variable3}}[end if]
</div>

<div>
[if variable1]

    {{variable1}}

    [if variable2]
        <br>
        {{variable2}}
    [end if]

[end if]
</div>

It's working as expected. I get 2 matches, and if the 2nd match is valid I can parse the inner capture.

Problem is if i have multiple nested matches. So given:

<div>
[if variable3]{{variable3}}[end if]
</div>

<div>
[if variable1]

    {{variable1}}

    [if variable2]
        <br>
        {{variable2}}
    [end if]

    [if variable4]
        <br>
        {{variable4}}
    [end if]

    [if variable5]
        <br>
        {{variable5}}
    [end if]

[end if]
</div>

What I end up with is the first capture being correct, and then all 3 individual captures and not the outer one for the 2nd match.

If I expand the capture to ignore \[ for the inner content, it causes the first and second match to combine into a single match. :(

Does anyone know how to fix this? (and if you have a better idea of how to do this templating would be keen to know in the comments)

Phill
  • 18,398
  • 7
  • 62
  • 102
  • I'd recommend looking into an XML parser. Attempting to parse HTML with regex is [generally discouraged](https://stackoverflow.com/a/1732454/4416750). – Lews Therin Jan 23 '19 at 14:19
  • @LewsTherin But he's not parsing HTML? – Kenneth K. Jan 23 '19 at 14:24
  • 2
    @LewsTherin I'm not parsing the HTML tho, I'm parsing my own syntax out of it. – Phill Jan 23 '19 at 14:24
  • How is extracting data from an HTML document not considered parsing? You need to be able to distinguish between the data you are interested in and the HTML tags. Is my understand of parsing wrong? – Lews Therin Jan 23 '19 at 14:28
  • My "tags" are `[if...]` `[end if]` `[for...]` and `[end for]` that's my total list. I could remove the HTML from my example above if that would make things clearer. – Phill Jan 23 '19 at 14:36
  • 1
    You may match all blocks correctly with `(?s)\[if\s+(?[^][]+)](?>(?:(?!\[if\s|\[end\ if]).)+|(?)\[end\ if]|(?)\[if\s+(?[^][]+)])*(?(open)(?!))\[end\ if]`, but I have doubts as for capturing the contents. – Wiktor Stribiżew Jan 23 '19 at 14:55
  • @WiktorStribiżew good step forward, that captures the block scope correctly, just need to figure out contents in the middle. – Phill Jan 23 '19 at 15:15
  • I think it is not possible. – Wiktor Stribiżew Jan 23 '19 at 15:17
  • @WiktorStribiżew well the capture is the whole group, it atleast gives me the `variable` named capture, so I can do my check to see if it should render or not, then just re-parse the whole capture text and trim the start/end. Just trying to do that now. – Phill Jan 23 '19 at 15:33
  • @WiktorStribiżew wooohoo, using your code, I trim the ends and re-parse the match if the variable exists, all my existing tests pass and the ones that were failing before now pass too. Thanks a lot :D – Phill Jan 23 '19 at 16:13

1 Answers1

1

You may use

@"(?s)\[if\s+(?<if>[^][]+)](?<fullBody>(?>(?:(?!\[if\s|\[end\ if]).)+|(?<-open>)\[end\ if]|(?<open>)\[if\s+(?<if>[^][]+)])*(?(open)(?!)))\[end\ if]"

See the regex demo.

Details (note that you may use it inside C# code due to the x modifier):

@"(?sx)               # Singleline and IgnorePatternWhitespace flags on
  \[if\s+             # "[if" and then 1+ whitespaces
    (?<if>[^][]+)     # "If" group: one or more chars other than "]"
  ]                   # a "]" char
   (?<fullBody>       # Group "fullBody" containing all nested if blocks
     (?>              # Start of an atomic group
       (?:(?!\[if\s|\[end\ if]).)+|    # any char, 1+ occurrences, that does not start the "[if " or "[end if]" substring, or...
       (?<-open>)\[end\ if]|           # "[end if]" substring and an item is popped from Group "open", or
       (?<open>)\[if\s+(?<if>[^][]+)]  # Group "open": "[if", 1+ whitespaces, Group "if": 1+ chars other than "[" and "]", and then a "]" char
     )*               # repeat atomic group patterns 0 or more times
     (?(open)(?!))    # A conditional: if Group "open" has any items on its stack, fail and backtrack
   )                  # End of fullBody group
  \[end\ if]"         # "[end if]" substring

If you do not care if an if block is nested in which block, you may plainly get a full list of if blocks using a variation of this regex:

var pattern = @"(?s)(?=(?<ifBlock>\[if\s+(?<if>[^][]+)](?<fullBody>(?>(?:(?!\[if\s|\[end\ if]).)+|(?<-open>)\[end\ if]|(?<open>)\[if\s+(?<if>[^][]+)])*(?(open)(?!)))\[end\ if]))";

The pattern above is just wrapped with another named capturing group and is placed inside a positive lookahead. While the match value will always be empty, groups will hold all the values you may need.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • You sir, are a literal Regex God. I REALLY appreciate you even adding the detailed break down to explain it all. Super helpful to learn from. – Phill Jan 24 '19 at 04:05