3

For example, I have a string like this:

{% a %}
    {% b %}
    {% end %}
{% end %}

I want to get the content between {% a %} and {% end %}, which is {% b %} {% end %}.
I used to use {% \S+ %}(.*){% end %} to do this. But when I add c in it:

 {% a %}
        {% b %}
        {% end %}
    {% end %}
{% c %}
{% end %}

It doesn't work... How could I do this with regular expression?

wong2
  • 34,358
  • 48
  • 134
  • 179
  • 3
    Is it a nested structure of arbitrary depth? If so, that is not a regular language. – eldarerathis Apr 07 '11 at 15:46
  • Please don't try. http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454 – casablanca Apr 07 '11 at 15:46
  • 2
    You will probably have a much easier time matching the individual elements with a regular expression and using a stack to match the opening / closing blocks. – GWW Apr 07 '11 at 15:47
  • 2
    @eldarethis: That is red herring, please stop repeating it. **IT DOES NOT APPLY** because it is absolutely trivial to match nested structures using modern patterns. – tchrist Apr 07 '11 at 16:55
  • 1
    @casablanca: Please stop posting that idiotic and irrelevant link. It does not apply, and is wrong anyway. – tchrist Apr 07 '11 at 17:18
  • 2
    @eldarerathis: Good thing that PHP regular expressions are not [REGULAR](http://kore-nordmann.de/blog/do_NOT_parse_using_regexp.html#comment_40 "Modern regexes are NOT REGULAR, and haven't been for a long, long time!")! – ridgerunner Apr 07 '11 at 18:54
  • @tchrist: That's why it was a comment and not an answer. I personally think that writing a parser is simpler and more understandable than a regex in such situations. – casablanca Apr 07 '11 at 20:37

3 Answers3

4

Given this test data:

$text = '
{% a %}
    {% b %}
        {% a %}
        {% end %}
    {% end %}
        {% b %}
        {% end %}
{% end %}
{% c %}
{% end %}
';

This tested script does the trick:

<?php
$re = '/
    # Match nested {% a %}{% b %}...{% end %}{% end %} structures.
    \{%[ ]\w[ ]%\}       # Opening delimiter.
    (?:                  # Group for contents alternatives.
      (?R)               # Either a nested recursive component,
    |                    # or non-recursive component stuff.
      [^{]*+             # {normal*} Zero or more non-{
      (?:                # Begin: "unrolling-the-loop"
        \{               # {special} Allow a { as long
        (?!              # as it is not the start of
          %[ ]\w[ ]%\}   # a new nested component, or
        | %[ ]end[ ]%\}  # the end of this component.
        )                # Ok to match { followed by
        [^{]*+           # more {normal*}. (See: MRE3!)
      )*+                # End {(special normal*)*} construct.
    )*+                  # Zero or more contents alternatives
    \{%[ ]end[ ]%\}      # Closing delimiter.
    /ix';
$count = preg_match_all($re, $text, $m);
if ($count) {
    printf("%d Matches:\n", $count);
    for ($i = 0; $i < $count; ++$i) {
        printf("\nMatch %d:\n%s\n", $i + 1, $m[0][$i]);
    }
}
?>

Here is the output:

2 Matches:

Match 1:
{% a %}
    {% b %}
        {% a %}
        {% end %}
    {% end %}
        {% b %}
        {% end %}
{% end %}

Match 2:
{% c %}
{% end %}

Edit: If you need to match an opening tag having more than one word char, replace the two occurrences of the \w tokens with (?!end)\w++, (as is correctly implemented in tchrist's excellent answer).

ridgerunner
  • 33,777
  • 5
  • 57
  • 69
2

Here is a demo in Perl of an approach that works for your dataset. The same should work in PHP.

#!/usr/bin/env perl

use strict;
use warnings;

my $string = <<'EO_STRING';
    {% a %}
            {% b %}
            {% end %}
        {% end %}
    {% c %}
    {% end %}
EO_STRING


print "MATCH: $&\n" while $string =~ m{
    \{ % \s+ (?!end) \w+ \s+ % \}
    (?: (?: (?! % \} | % \} ) . ) | (?R) )*
    \{ % \s+ end \s+ % \}
}xsg;

When run, that produces this:

MATCH: {% a %}
            {% b %}
            {% end %}
        {% end %}
MATCH: {% c %}
    {% end %}

There are several other ways to write that. You may have other constraints that you haven’t shown, but this should get you started.

tchrist
  • 78,834
  • 30
  • 123
  • 180
0

What you're looking for is called recursive regex. PHP has support for it using (?R).

I'm not familiar enough with it to be able to help you with the pattern itself, but hopefully this is a push in the right direction.

Mr. Llama
  • 20,202
  • 2
  • 62
  • 115