How to match some nested structure with regex?

Question

For example, I have a string like this:

{% a %}
    {% b %}
    {% end %}
{% end %}

I want to get the content between {% a %} and {% end %}, which is {% b %} {% end %}.
I used to use {% \S+ %}(.*){% end %} to do this. But when I add c in it:

 {% a %}
        {% b %}
        {% end %}
    {% end %}
{% c %}
{% end %}

It doesn't work... How could I do this with regular expression?

Is it a nested structure of arbitrary depth? If so, that is not a regular language. — eldarerathis, Apr 07 '11 at 15:46
Please don't try. http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454 — casablanca, Apr 07 '11 at 15:46
You will probably have a much easier time matching the individual elements with a regular expression and using a stack to match the opening / closing blocks. — GWW, Apr 07 '11 at 15:47
@eldarethis: That is red herring, please stop repeating it. **IT DOES NOT APPLY** because it is absolutely trivial to match nested structures using modern patterns. — tchrist, Apr 07 '11 at 16:55
@casablanca: Please stop posting that idiotic and irrelevant link. It does not apply, and is wrong anyway. — tchrist, Apr 07 '11 at 17:18
@eldarerathis: Good thing that PHP regular expressions are not [REGULAR](http://kore-nordmann.de/blog/do_NOT_parse_using_regexp.html#comment_40 "Modern regexes are NOT REGULAR, and haven't been for a long, long time!")! — ridgerunner, Apr 07 '11 at 18:54
@tchrist: That's why it was a comment and not an answer. I personally think that writing a parser is simpler and more understandable than a regex in such situations. — casablanca, Apr 07 '11 at 20:37

ridgerunner · Accepted Answer · 2011-04-07T19:18:37.437

Given this test data:

$text = '
{% a %}
    {% b %}
        {% a %}
        {% end %}
    {% end %}
        {% b %}
        {% end %}
{% end %}
{% c %}
{% end %}
';

This tested script does the trick:

<?php
$re = '/
    # Match nested {% a %}{% b %}...{% end %}{% end %} structures.
    \{%[ ]\w[ ]%\}       # Opening delimiter.
    (?:                  # Group for contents alternatives.
      (?R)               # Either a nested recursive component,
    |                    # or non-recursive component stuff.
      [^{]*+             # {normal*} Zero or more non-{
      (?:                # Begin: "unrolling-the-loop"
        \{               # {special} Allow a { as long
        (?!              # as it is not the start of
          %[ ]\w[ ]%\}   # a new nested component, or
        | %[ ]end[ ]%\}  # the end of this component.
        )                # Ok to match { followed by
        [^{]*+           # more {normal*}. (See: MRE3!)
      )*+                # End {(special normal*)*} construct.
    )*+                  # Zero or more contents alternatives
    \{%[ ]end[ ]%\}      # Closing delimiter.
    /ix';
$count = preg_match_all($re, $text, $m);
if ($count) {
    printf("%d Matches:\n", $count);
    for ($i = 0; $i < $count; ++$i) {
        printf("\nMatch %d:\n%s\n", $i + 1, $m[0][$i]);
    }
}
?>

Here is the output:

2 Matches:

Match 1:
{% a %}
    {% b %}
        {% a %}
        {% end %}
    {% end %}
        {% b %}
        {% end %}
{% end %}

Match 2:
{% c %}
{% end %}

Edit: If you need to match an opening tag having more than one word char, replace the two occurrences of the \w tokens with (?!end)\w++, (as is correctly implemented in tchrist's excellent answer).

score 2 · Answer 2 · answered Apr 07 '11 at 17:17

Here is a demo in Perl of an approach that works for your dataset. The same should work in PHP.

#!/usr/bin/env perl

use strict;
use warnings;

my $string = <<'EO_STRING';
    {% a %}
            {% b %}
            {% end %}
        {% end %}
    {% c %}
    {% end %}
EO_STRING


print "MATCH: $&\n" while $string =~ m{
    \{ % \s+ (?!end) \w+ \s+ % \}
    (?: (?: (?! % \} | % \} ) . ) | (?R) )*
    \{ % \s+ end \s+ % \}
}xsg;

When run, that produces this:

MATCH: {% a %}
            {% b %}
            {% end %}
        {% end %}
MATCH: {% c %}
    {% end %}

There are several other ways to write that. You may have other constraints that you haven’t shown, but this should get you started.

score 0 · Answer 3 · answered Apr 07 '11 at 15:53

0

What you're looking for is called recursive regex. PHP has support for it using (?R).

I'm not familiar enough with it to be able to help you with the pattern itself, but hopefully this is a push in the right direction.

answered Apr 07 '11 at 15:53

Mr. Llama

20,202
2
62
115

How to match some nested structure with regex?

3 Answers3