3

I need to get the nested blocks in hash array or hash tree to be able to substitute the blocks with dynamic contents. I need to replace the code between

<!--block:XXX-->

and the first closing end block

<!--endblock--> 

with my dynamic content.

I have this code that finds one level comments blocks but not nested:

#<!--block:listing-->... html code block here ...<!--endblock-->
$blocks{$1} = $2 while $content =~ /<!--block:(.*?)-->((?:(?:(?!<!--(.*?)-->).)|(?R))*?)<!--endblock-->/igs;

Here is the complete nested html template that I want to process. So I need to find and replace the inner block "block:third" and replace it with my content , then find "block:second" and replace it then find the outer block "block:first" and replace it. Please note that, there can be any number of nested blocks and not just three like the example below, it could be several nested blocks.

use Data::Dumper;

$content=<<HTML;
some html content here

<!--block:first-->
    some html content here

    <!--block:second-->
        some html content here

        <!--block:third-->
            some html content here
        <!--endblock-->

        some html content here
    <!--endblock-->

    some html content here
<!--endblock-->
HTML

$blocks{$1} = $2 while $content =~ /<!--block:(.*?)-->((?:(?:(?!<!--(.*?)-->).)|(?R))*?)<!--endblock-->/igs;
print Dumper(%blocks);

So I can access and modify the blocks like $block{first} = "my content here" and $block{second} = "another content here" etc then replace the blocks.

I created this regex

Andy Lester
  • 91,102
  • 13
  • 100
  • 152
daliaessam
  • 1,636
  • 2
  • 21
  • 43
  • 4
    [You shouldn't use regex to parse arbitrary HTML.](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) That said, this would probably be a lot easier if your end tags contained the same number as the starting tags. Why not use `` for example? – CAustin Mar 13 '14 at 17:53
  • There are lots of other templating systems like [Template::Toolkit](http://www.template-toolkit.org/) that are going to more effectively handle your goal. However, if you replace the first level block with new html, isn't it going to copy over whatever you ideally wanted in the second level? A single assignment doesn't really make sense if they really are nested. – Miller Mar 13 '14 at 17:53
  • @CAustin It will be easy to just use for each block instead of named the end block also, but your suggestion is respectful. – daliaessam Mar 13 '14 at 17:56
  • @Miller I will replace the "third" block with contents then replace it in the second then replace the first. The first which already is part of the whole template. I am building my own simple template system. – daliaessam Mar 13 '14 at 17:58

4 Answers4

2

Update:

This is a response to the "combining" into a single regex...

It appears you don't care about reconstructing the order of the html.
So, if you just want to isolate the content for each sub-section, the below is all you need.
However, you will need lists ( [] ) to reconstitute the order of embedded sub-sections.

After refreshing myself with this question, note that the regex used below is the one you should be using.

use Data::Dumper;

$/ = undef;
my $content = <DATA>;


my $href = {};

ParseCore( $href, $content );

#print Dumper($href);

print "\nBase======================\n";
print $href->{content};
print "\nFirst======================\n";
print $href->{first}->{content};
print "\nSecond======================\n";
print $href->{first}->{second}->{content};
print "\nThird======================\n";
print $href->{first}->{second}->{third}->{content};
print "\nFourth======================\n";
print $href->{first}->{second}->{third}->{fourth}->{content};
print "\nFifth======================\n";
print $href->{first}->{second}->{third}->{fourth}->{fifth}->{content};

exit;

sub ParseCore
{
    my ($aref, $core) = @_;
    my ($k, $v);
    while ( $core =~ /(?is)(<!--block:(.*?)-->((?:(?:(?!<!--block:(?:.*?)-->).)|(?R))*?)<!--endblock-->|((?:(?!<!--block:.*?-->).)+))/g )
    {
       if (defined $2) {
           $k = $2; $v = $3;
           $aref->{$k} = {};
 #         $aref->{$k}->{content} = $v;
 #         $aref->{$k}->{match} = $1;

           my $curraref = $aref->{$k};
           my $ret = ParseCore($aref->{$k}, $v);
           if (defined $ret) {
               $curraref->{'#next'} = $ret;
           }
        }
        else
        {
           $aref->{content} .= $4;
        }
    }
    return $k;
}

#================================================
__DATA__
some html content here top base
<!--block:first-->
    <table border="1" style="color:red;">
    <tr class="lines">
        <td align="left" valign="<--valign-->">
    <b>bold</b><a href="http://www.mewsoft.com">mewsoft</a>
    <!--hello--> <--again--><!--world-->
    some html content here 1 top
    <!--block:second-->
        some html content here 2 top
        <!--block:third-->
            some html content here 3 top
            <!--block:fourth-->
                some html content here 4 top
                <!--block:fifth-->
                    some html content here 5a
                    some html content here 5b
                <!--endblock-->
            <!--endblock-->
            some html content here 3a
            some html content here 3b
        <!--endblock-->
        some html content here 2 bottom
    <!--endblock-->
    some html content here 1 bottom
<!--endblock-->
some html content here1-5 bottom base

some html content here 6-8 top base
<!--block:six-->
    some html content here 6 top
    <!--block:seven-->
        some html content here 7 top
        <!--block:eight-->
            some html content here 8a
            some html content here 8b
        <!--endblock-->
        some html content here 7 bottom
    <!--endblock-->
    some html content here 6 bottom
<!--endblock-->
some html content here 6-8 bottom base

Output >>

Base======================
some html content here top base

some html content here1-5 bottom base

some html content here 6-8 top base

some html content here 6-8 bottom base
First======================

    <table border="1" style="color:red;">
    <tr class="lines">
        <td align="left" valign="<--valign-->">
    <b>bold</b><a href="http://www.mewsoft.com">mewsoft</a>
    <!--hello--> <--again--><!--world-->
    some html content here 1 top

    some html content here 1 bottom

Second======================

        some html content here 2 top

        some html content here 2 bottom

Third======================

            some html content here 3 top

            some html content here 3a
            some html content here 3b

Fourth======================

                some html content here 4 top


Fifth======================

                    some html content here 5a
                    some html content here 5b

You can use REGEX recursion to match outter nesting's, then parse the inner CORE's
using a simple recursive function call.

Then its also possible to parse content on the nesting level that you are on.
Its also possible to create a nested structure along the way to enable you to later
do the template substitutions.

You can then reconstruct the html.
The only tricky part is traversing the array. But, if you know how to traverse
array's (scalars, array/hash ref's, and such) it should be no problem.

Here is the sample.

    # (?is)<!--block:(.*?)-->((?:(?:(?!<!--(?:.*?)-->).)|(?R))*?)<!--endblock-->|((?:(?!<!--.*?-->).)+)

    (?is)                         # Modifiers: Case insensitive, Dot-all
    <!--block:                    # Begin BLOCK
    ( .*? )                       # (1), block name
    -->

    (                             # (2 start), Begin Core
         (?:
              (?:
                   (?!
                        <!--
                        (?: .*? )
                        -->
                   )
                   . 
              )
           |  (?R) 
         )*?
    )                             # (2 end), End Core

    <!--endblock-->               # End BLOCK
 |  
    (                             # (3 start), Or grab content within this core
         (?:
              (?! <!-- .*? --> )
              . 
         )+
    )                             # (3 end)

Perl test case

use Data::Dumper;

$/ = undef;
my $content = <DATA>;


my %blocks = ();
$blocks{'base'} = [];


ParseCore( $blocks{'base'}, $content );


sub ParseCore
{
    my ($aref, $core) = @_;
    while ( $core =~ /(?is)<!--block:(.*?)-->((?:(?:(?!<!--(?:.*?)-->).)|(?R))*?)<!--endblock-->|((?:(?!<!--.*?-->).)+)/g )
    {
        if ( defined $1 )
        {
           my $branch = {};
           push @{$aref}, $branch;
           $branch->{$1} = [];
           ParseCore( $branch->{$1}, $2 );
        }
        elsif ( defined $3 )
        {
           push @{$aref}, $3;
        }
    }

}

print Dumper(\%blocks);

__DATA__

some html content here top base
<!--block:first-->
    some html content here 1 top
    <!--block:second-->
        some html content here 2 top
        <!--block:third-->
            some html content here 3a
            some html content here 3b
        <!--endblock-->
        some html content here 2 bottom
    <!--endblock-->
    some html content here 1 bottom
<!--endblock-->
some html content here bottom base

Output >>

$VAR1 = {
          'base' => [
                      '
some html content here top base
',
                      {
                        'first' => [
                                     '
    some html content here 1 top
    ',
                                     {
                                       'second' => [
                                                     '
        some html content here 2 top
        ',
                                                     {
                                                       'third' => [
                                                                    '
            some html content here 3a
            some html content here 3b
        '
                                                                  ]
                                                     },
                                                     '
        some html content here 2 bottom
    '
                                                   ]
                                     },
                                     '
    some html content here 1 bottom
'
                                   ]
                      },
                      '
some html content here bottom base
'
                    ]
        };
  • This works but fails if any block contains a comment tags like ` some html content here 1 top . ` any fix. – daliaessam Jun 03 '14 at 21:01
  • @daliaessam - I don't come here much these days, so I won't be seeing messages in short intervals. After 1-1/2 years I can't remember what this ?'s solution was about. It was a general answer, requiring you to fill in the blanks. If your delimiter is `((?:(?:(?!).)|(?R))*?)|((?:(?!).)+)`. Otherwise, it's been too long ago. –  Jun 05 '14 at 21:22
  • I used your answer and regex to complete the solution, if you can look at my solution and see if it can be optimized, your regex only finds the first nest blocks and stop, is there away to modify it to find next root nested blocks, if you see my solution, I find the outer blocks first then pass them to the nested block parser which is your solution. – daliaessam Jun 05 '14 at 22:43
  • +1000_000 for the new and complete update with one regex for the entire parsing process. – daliaessam Jun 08 '14 at 12:36
1

Based on @sln answer above and despite the advises to use Perl templates or parsers modules, I assure there is no one of these modules that can handle this issue direct.

Here is the solution I came up with.

First I find the outer blocks in the entire content or template with simple regex:

/(<!--block:.*?-->(?:(?:(?!<!--(?:.*?)-->).)|(?R))*?<!--endblock-->)/gis

Then I parse each outer block to find its nested sub blocks based on @sln answer above.

/(<!--block:(.*?)-->((?:(?:(?!<!--block:(?:.*?)-->).)|(?R))*?)<!--endblock-->|((?:(?!<!--.*?-->).)+))/igsx

Then everything is working well. I tested with two outer blocks and each one has nested blocks.

I can reach any sub block simply like that:

print $blocks->{first}->{content};

print $blocks->{first}->{match};

print $blocks->{first}->{second}->{third}->{fourth}->{content}

Each block hash ref has the keys:

`content`: the block content without the block name and endblock tags.
`match`: the block content with the block name and endblock tags, good for replacing.
`#next`: has the sub block name if exists, good to check if block has children and access them.

Below is the final Perl tested and working code.

use Data::Dumper;

$/ = undef;
my $content = <DATA>;

my $blocks = parse_blocks($content);

print Dumper($blocks);

#print join "\n", keys( %{$blocks->{first}}); # root blocks names
#print join "\n", keys( %{$blocks->{first}}); # 
#print join "\n", keys( %{$blocks->{first}->{second}});

#print Dumper $blocks->{first};
#print Dumper $blocks->{first}->{content};
#print Dumper $blocks->{first}->{match};

# check if fourth block has sub block.
#print exists $blocks->{first}->{second}->{third}->{fourth}->{'#next'}, "\n";

# check if block has sub block, get it:
#if (exists $blocks->{first}->{second}->{third}->{fourth}->{'#next'}) {
#   print $blocks->{first}->{second}->{third}->{fourth}->{ $blocks->{first}->{second}->{third}->{fourth}->{'#next'} }->{content}, "\n";
#}

exit;
#================================================
sub parse_blocks {
    my ($content) = @_;
    my $href = {};
    # find outer blocks only
    while ($content =~ /(<!--block:.*?-->(?:(?:(?!<!--(?:.*?)-->).)|(?R))*?<!--endblock-->)/gis) {
        # parse each outer block nested blocks
        parse_nest_blocks($href, $1);
    }
    return $href;
}
#================================================
sub parse_nest_blocks {
    my ($aref, $core) = @_;
    my ($k, $v);
    while ( $core =~ /(<!--block:(.*?)-->((?:(?:(?!<!--block:(?:.*?)-->).)|(?R))*?)<!--endblock-->|((?:(?!<!--.*?-->).)+))/igsx )
    {
        if (defined $2) {
           $k = $2; $v = $3;
           $aref->{$k} = {};
           $aref->{$k}->{content} = $v;
           $aref->{$k}->{match} = $1;
           #print "1:{{$k}}\n2:[[$v]]\n";
           my $curraref = $aref->{$k};
           my $ret = parse_nest_blocks($aref->{$k}, $v);
           if ($ret) {
               $curraref->{'#next'} = $ret;
           }
           return $k;
        }
    }

}
#================================================
__DATA__
some html content here top base
<!--block:first-->
    <table border="1" style="color:red;">
    <tr class="lines">
        <td align="left" valign="<--valign-->">
    <b>bold</b><a href="http://www.mewsoft.com">mewsoft</a>
    <!--hello--> <--again--><!--world-->
    some html content here 1 top
    <!--block:second-->
        some html content here 2 top
        <!--block:third-->
            some html content here 3 top
            <!--block:fourth-->
                some html content here 4 top
                <!--block:fifth-->
                    some html content here 5a
                    some html content here 5b
                <!--endblock-->
            <!--endblock-->
            some html content here 3a
            some html content here 3b
        <!--endblock-->
        some html content here 2 bottom
    <!--endblock-->
    some html content here 1 bottom
<!--endblock-->
some html content here1-5 bottom base

some html content here 6-8 top base
<!--block:six-->
    some html content here 6 top
    <!--block:seven-->
        some html content here 7 top
        <!--block:eight-->
            some html content here 8a
            some html content here 8b
        <!--endblock-->
        some html content here 7 bottom
    <!--endblock-->
    some html content here 6 bottom
<!--endblock-->
some html content here 6-8 bottom base

and the output of the entire hash dump is:

$VAR1 = {
          'first' => {
                       'second' => {
                                     'third' => {
                                                  'match' => '<!--block:third-->
            some html content here 3 top
            <!--block:fourth-->
                some html content here 4 top
                <!--block:fifth-->
                    some html content here 5a
                    some html content here 5b
                <!--endblock-->
            <!--endblock-->
            some html content here 3a
            some html content here 3b
        <!--endblock-->',
                                                  'content' => '
            some html content here 3 top
            <!--block:fourth-->
                some html content here 4 top
                <!--block:fifth-->
                    some html content here 5a
                    some html content here 5b
                <!--endblock-->
            <!--endblock-->
            some html content here 3a
            some html content here 3b
        ',
                                                  'fourth' => {
                                                                'fifth' => {
                                                                             'match' => '<!--block:fifth-->
                    some html content here 5a
                    some html content here 5b
                <!--endblock-->',
                                                                             'content' => '
                    some html content here 5a
                    some html content here 5b
                '
                                                                           },
                                                                'match' => '<!--block:fourth-->
                some html content here 4 top
                <!--block:fifth-->
                    some html content here 5a
                    some html content here 5b
                <!--endblock-->
            <!--endblock-->',
                                                                'content' => '
                some html content here 4 top
                <!--block:fifth-->
                    some html content here 5a
                    some html content here 5b
                <!--endblock-->
            ',
                                                                '#next' => 'fifth'
                                                              },
                                                  '#next' => 'fourth'
                                                },
                                     'match' => '<!--block:second-->
        some html content here 2 top
        <!--block:third-->
            some html content here 3 top
            <!--block:fourth-->
                some html content here 4 top
                <!--block:fifth-->
                    some html content here 5a
                    some html content here 5b
                <!--endblock-->
            <!--endblock-->
            some html content here 3a
            some html content here 3b
        <!--endblock-->
        some html content here 2 bottom
    <!--endblock-->',
                                     'content' => '
        some html content here 2 top
        <!--block:third-->
            some html content here 3 top
            <!--block:fourth-->
                some html content here 4 top
                <!--block:fifth-->
                    some html content here 5a
                    some html content here 5b
                <!--endblock-->
            <!--endblock-->
            some html content here 3a
            some html content here 3b
        <!--endblock-->
        some html content here 2 bottom
    ',
                                     '#next' => 'third'
                                   },
                       'match' => '<!--block:first-->
    <table border="1" style="color:red;">
    <tr class="lines">
        <td align="left" valign="<--valign-->">
    <b>bold</b><a href="http://www.mewsoft.com">mewsoft</a>
    <!--hello--> <--again--><!--world-->
    some html content here 1 top
    <!--block:second-->
        some html content here 2 top
        <!--block:third-->
            some html content here 3 top
            <!--block:fourth-->
                some html content here 4 top
                <!--block:fifth-->
                    some html content here 5a
                    some html content here 5b
                <!--endblock-->
            <!--endblock-->
            some html content here 3a
            some html content here 3b
        <!--endblock-->
        some html content here 2 bottom
    <!--endblock-->
    some html content here 1 bottom
<!--endblock-->',
                       'content' => '
    <table border="1" style="color:red;">
    <tr class="lines">
        <td align="left" valign="<--valign-->">
    <b>bold</b><a href="http://www.mewsoft.com">mewsoft</a>
    <!--hello--> <--again--><!--world-->
    some html content here 1 top
    <!--block:second-->
        some html content here 2 top
        <!--block:third-->
            some html content here 3 top
            <!--block:fourth-->
                some html content here 4 top
                <!--block:fifth-->
                    some html content here 5a
                    some html content here 5b
                <!--endblock-->
            <!--endblock-->
            some html content here 3a
            some html content here 3b
        <!--endblock-->
        some html content here 2 bottom
    <!--endblock-->
    some html content here 1 bottom
',
                       '#next' => 'second'
                     },
          'six' => {
                     'match' => '<!--block:six-->
    some html content here 6 top
    <!--block:seven-->
        some html content here 7 top
        <!--block:eight-->
            some html content here 8a
            some html content here 8b
        <!--endblock-->
        some html content here 7 bottom
    <!--endblock-->
    some html content here 6 bottom
<!--endblock-->',
                     'content' => '
    some html content here 6 top
    <!--block:seven-->
        some html content here 7 top
        <!--block:eight-->
            some html content here 8a
            some html content here 8b
        <!--endblock-->
        some html content here 7 bottom
    <!--endblock-->
    some html content here 6 bottom
',
                     'seven' => {
                                  'match' => '<!--block:seven-->
        some html content here 7 top
        <!--block:eight-->
            some html content here 8a
            some html content here 8b
        <!--endblock-->
        some html content here 7 bottom
    <!--endblock-->',
                                  'content' => '
        some html content here 7 top
        <!--block:eight-->
            some html content here 8a
            some html content here 8b
        <!--endblock-->
        some html content here 7 bottom
    ',
                                  'eight' => {
                                               'match' => '<!--block:eight-->
            some html content here 8a
            some html content here 8b
        <!--endblock-->',
                                               'content' => '
            some html content here 8a
            some html content here 8b
        '
                                             },
                                  '#next' => 'eight'
                                },
                     '#next' => 'seven'
                   }
        };
daliaessam
  • 1,636
  • 2
  • 21
  • 43
  • Good Job daliaessam, glad you got it working. I'm still fuzzy after 1-/1/2 years though. Good Luck my friend! –  Jun 05 '14 at 23:20
  • @sln do you think the two regex's can be combined instead of first parsing the outer blocks then parsing each outer block, can we make your regex scan the entire template instead of stopping after nesting the first outer block. thank you for your time. – daliaessam Jun 06 '14 at 00:03
  • Hi, I posted an update for you, not sure if this is what your question was about, but it should point you in the right direction. Cheers! –  Jun 07 '14 at 20:14
  • Yes the update you posted is exactly what I want. +1000_000_000 votes up. – daliaessam Jun 08 '14 at 12:38
1

I must repeat for you and anyone else who might find this thread, do not use regular expressions in such a complicated way.

I love regular expressions, but they were not designed for this sort of problem. You're going to be 1,000 times better off using a standard templating system like Template::Toolkit.

The problem with regular expressions in this context is there's a tendency to couple parsing with validation. By doing that, the regex ends up being very fragile and it's common for people to skip validation of their data entirely. For example, when a recursive regex sees ((( )), it will claim there are only 2 levels to those parenthesis. In truth, there are 2 and a 1/2, and that 1/2 is an error that won't be reported.

Now, I already communicated the way to avoid this flaw in regex parsing in my answers to two of your other questions:

Basically, make your parsing regex as simple as possible. This serves multiple purposes. It ensures that your regex will be less fragile, and it also encourages not putting the validation in the parsing phase.

I showed you how start this particular stackoverflow problem in the second above solution. Basically, tokenize your data, and then translate the results into your more complicated data structure. I've had some spare time today, so have decided to actually fully demonstrate how that translation can be easily done:

use strict;
use warnings;

use Data::Dump qw(dump dd);

my $content = do {local $/; <DATA>};

# Tokenize Content
my @tokens = split m{<!--(?:block:(.*?)|(endblock))-->}, $content;

# Resulting Data Structure
my @data = (
    shift @tokens, # First element of split is always HTML
);

# Keep track of levels of content
# - This is a throwaway data structure to facilitate the building of nested content
my @levels = ( \@data );

while (@tokens) {
    # Tokens come in groups of 3.  Two capture groups in split delimiter, followed by html.
    my ($block, $endblock, $html) = splice @tokens, 0, 3;

    # Start of Block - Go up to new level
    if (defined $block) {
        #debug# print +('  ' x @levels) ."<$block>\n";
        my $hash = {
            block    => $block,
            content  => [],
        };
        push @{$levels[-1]}, $hash;
        push @levels, $hash->{content};

    # End of Block - Go down level
    } elsif (defined $endblock) {
        die "Error: Unmatched endblock found before " . dump($html) if @levels == 1;
        pop @levels;
        #debug# print +('  ' x @levels) . "</$levels[-1][-1]{block}>\n";
    }

    # Append HTML content
    push @{$levels[-1]}, $html;
}
die "Error: Unmatched start block: $levels[-2][-1]{block}" if @levels > 1;

dd @data;

__DATA__

some html content here top base
<!--block:first-->
    <table border="1" style="color:red;">
    <tr class="lines">
        <td align="left" valign="<--valign-->">
    <b>bold</b><a href="http://www.mewsoft.com">mewsoft</a>
    <!--hello--> <--again--><!--world-->
    some html content here 1 top
    <!--block:second-->
        some html content here 2 top
        <!--block:third-->
            some html content here 3 top
            <!--block:fourth-->
                some html content here 4 top
                <!--block:fifth-->
                    some html content here 5a
                    some html content here 5b
                <!--endblock-->
            <!--endblock-->
            some html content here 3a
            some html content here 3b
        <!--endblock-->
        some html content here 2 bottom
    <!--endblock-->
    some html content here 1 bottom
<!--endblock-->
some html content here1-5 bottom base

some html content here 6-8 top base
<!--block:six-->
    some html content here 6 top
    <!--block:seven-->
        some html content here 7 top
        <!--block:eight-->
            some html content here 8a
            some html content here 8b
        <!--endblock-->
        some html content here 7 bottom
    <!--endblock-->
    some html content here 6 bottom
<!--endblock-->
some html content here 6-8 bottom base

If you uncomment the debugging statements, you'll observe the following traversal of the tokens to builds the structure that you want:

  <first>
    <second>
      <third>
        <fourth>
          <fifth>
          </fifth>
        </fourth>
      </third>
    </second>
  </first>
  <six>
    <seven>
      <eight>
      </eight>
    </seven>
  </six>

And the fully resulting data structure is:

(
    "\nsome html content here top base\n",
    {
        block   => "first",
        content => [
            "\n    <table border=\"1\" style=\"color:red;\">\n    <tr class=\"lines\">\n        <td align=\"left\" valign=\"<--valign-->\">\n    <b>bold</b><a href=\"http://www.mewsoft.com\">mewsoft</a>\n    <!--hello--> <--again--><!--world-->\n    some html content here 1 top\n    ",
            {
                block   => "second",
                content => [
                    "\n        some html content here 2 top\n        ",
                    {
                        block   => "third",
                        content => [
                            "\n            some html content here 3 top\n            ",
                            {
                                block   => "fourth",
                                content => [
                                    "\n                some html content here 4 top\n                ",
                                    {
                                        block   => "fifth",
                                        content => [
                                            "\n                    some html content here 5a\n                    some html content here 5b\n                ",
                                        ],
                                    },
                                    "\n            ",
                                ],
                            },
                            "\n            some html content here 3a\n            some html content here 3b\n        ",
                        ],
                    },
                    "\n        some html content here 2 bottom\n    ",
                ],
            },
            "\n    some html content here 1 bottom\n",
        ],
    },
    "\nsome html content here1-5 bottom base\n\nsome html content here 6-8 top base\n",
    {
        block   => "six",
        content => [
            "\n    some html content here 6 top\n    ",
            {
                block   => "seven",
                content => [
                    "\n        some html content here 7 top\n        ",
                    {
                        block   => "eight",
                        content => [
                            "\n            some html content here 8a\n            some html content here 8b\n        ",
                        ],
                    },
                    "\n        some html content here 7 bottom\n    ",
                ],
            },
            "\n    some html content here 6 bottom\n",
        ],
    },
    "\nsome html content here 6-8 bottom base",
);

Now, why is this method better?

It's less fragile. You already observed how in your previous regex was broken when other html comments were in the content. The tools used to parse here are extremely simple and so there is much less risk of the regex hiding edge cases.

Additionally, it's extremely easy to add functionality to this code. If you wanted to include parameters in your blocks, you could do it the exact same way as demonstrated in my solution to this problem of yours. The parsing and validation functionality wouldn't even have to be changed.

It reports errors Remove a character from either 'endblock' or 'block' and see what happens. It will give you an explicit error message:

Error: Unmatched start block: first at h.pl line 43

Your recursive regex would just hide the fact that there was an unmatched block in your content. You of course might observe it in your browser when you ran your code, but this way the error is reported immediately and you can track it down.

Summary:

Finally, I will say again, that the best way to solve this problem is not to try to create your own templating system, but to instead use an already created framework such as Template::Toolkit. You commented before that one of your motivations was that you wanted to use a design editor for your templates and that's why you wanted them to use html comments for the templates. However, there are ways to accommodate that desire too with existing frameworks.

Regardless, I hope that you're able to learn something from this code. Recursive regular expressions are cool tools, and great for validating data. But they should not be used for parsing, and hopefully anyone else who is searching for how to use recursive regular expressions will pause and potentially rethink their approach if they are wanting them for that reason.

Community
  • 1
  • 1
Miller
  • 34,962
  • 4
  • 39
  • 60
  • @sln An alternative approach for your consideration. – Miller Jun 08 '14 at 00:26
  • Its all very nice to consider but the premise is incorrect. Regex recursion _is_ the way to go in this case. I posted a general simple template that shows how this is done. Its robust and expandable. The engine cannot skip a single character, its forced to match every one. This can lead to some very sophisticated parsing options including inline multiple cores (languages). The downside is the stack, but modern engines are capable of using huge stack counts and sizes. –  Jun 09 '14 at 23:44
  • 1
    @sln Wrong in 2 major ways and at least 2 minor ones. **1) Your code does not fail gracefully.** If you rename `endblock` as `edblock`, your regex will completely hang. Mine will report an error stating there's a missing endblock. **2) My solution is 5 times as fast**. 100,000 iterations of the simple template takes 26 secs for your solution, and takes 5 secs for mine. For the more complicated template, yours takes 105 seconds, mine takes 13 secs which is actually 8 times faster. QED – Miller Jun 10 '14 at 02:05
1

I'm gonna add an additional answer. It's in line with my previous answer, but slightly more
complete and I don't want to muddy up that answer any more.

This is for @daliaessam and kind of a specific response to @Miller anecdote's on recursive parsing
using regular expressions.

There is only 3 parts to consider. So, using my previous manifestation, I lay out to you guys a
template on how to do this. Its not as hard as you think.

Cheers!

 # //////////////////////////////////////////////////////
 # // The General Guide to 3-Part Recursive Parsing
 # // ----------------------------------------------
 # // Part 1. CONTENT
 # // Part 2. CORE
 # // Part 3. ERRORS

 (?is)

 (?:
      (                                  # (1), Take off CONTENT
           (?&content) 
      )
   |                                   # OR
      (?>                                # Start-Delimiter (in this case, must be atomic because of .*?)
           <!--block:
           ( .*? )                            # (2), Block name
           -->
      )
      (                                  # (3), Take off The CORE
           (?&core) 
        |  
      )
      <!--endblock-->                    # End-Delimiter

   |                                   # OR
      (                                  # (4), Take off Unbalanced (delimeter) ERRORS
           <!--
           (?: block: .*? | endblock )
           -->
      )
 )

 # ///////////////////////
 # // Subroutines
 # // ---------------

 (?(DEFINE)

      # core
      (?<core>
           (?>
                (?&content) 
             |  
                (?> <!--block: .*? --> )
                # recurse core
                (?:
                     (?&core) 
                  |  
                )
                <!--endblock-->
           )+
      )

      # content 
      (?<content>
           (?>
                (?!
                     <!--
                     (?: block: .*? | endblock )
                     -->
                )
                . 
           )+
      )

 )

Perl code:

use strict;
use warnings;

use Data::Dumper;

$/ = undef;
my $content = <DATA>;

# Set the error mode on/off here ..
my $BailOnError = 1;
my $IsError = 0;

my $href = {};

ParseCore( $href, $content );

#print Dumper($href);

print "\n\n";
print "\nBase======================\n";
print $href->{content};
print "\nFirst======================\n";
print $href->{first}->{content};
print "\nSecond======================\n";
print $href->{first}->{second}->{content};
print "\nThird======================\n";
print $href->{first}->{second}->{third}->{content};
print "\nFourth======================\n";
print $href->{first}->{second}->{third}->{fourth}->{content};
print "\nFifth======================\n";
print $href->{first}->{second}->{third}->{fourth}->{fifth}->{content};
print "\nSix======================\n";
print $href->{six}->{content};
print "\nSeven======================\n";
print $href->{six}->{seven}->{content};
print "\nEight======================\n";
print $href->{six}->{seven}->{eight}->{content};

exit;


sub ParseCore
{
    my ($aref, $core) = @_;
    my ($k, $v);
    while ( $core =~ /(?is)(?:((?&content))|(?><!--block:(.*?)-->)((?&core)|)<!--endblock-->|(<!--(?:block:.*?|endblock)-->))(?(DEFINE)(?<core>(?>(?&content)|(?><!--block:.*?-->)(?:(?&core)|)<!--endblock-->)+)(?<content>(?>(?!<!--(?:block:.*?|endblock)-->).)+))/g )
    {
       if (defined $1)
       {
         # CONTENT
           $aref->{content} .= $1;
       }
       elsif (defined $2)
       {
         # CORE
           $k = $2; $v = $3;
           $aref->{$k} = {};
 #         $aref->{$k}->{content} = $v;
 #         $aref->{$k}->{match} = $&;

           my $curraref = $aref->{$k};
           my $ret = ParseCore($aref->{$k}, $v);
           if ( $BailOnError && $IsError ) {
               last;
           }
           if (defined $ret) {
               $curraref->{'#next'} = $ret;
           }
       }
       else
       {
         # ERRORS
           print "Unbalanced '$4' at position = ", $-[0];
           $IsError = 1;

           # Decide to continue here ..
           # If BailOnError is set, just unwind recursion. 
           # -------------------------------------------------
           if ( $BailOnError ) {
              last;
           }
       }
    }
    return $k;
}

#================================================
__DATA__
some html content here top base
<!--block:first-->
    <table border="1" style="color:red;">
    <tr class="lines">
        <td align="left" valign="<--valign-->">
    <b>bold</b><a href="http://www.mewsoft.com">mewsoft</a>
    <!--hello--> <--again--><!--world-->
    some html content here 1 top
    <!--block:second-->
        some html content here 2 top
        <!--block:third-->
            some html content here 3 top
            <!--block:fourth-->
                some html content here 4 top
                <!--block:fifth-->
                    some html content here 5a
                    some html content here 5b
                <!--endblock-->
            <!--endblock-->
            some html content here 3a
            some html content here 3b
        <!--endblock-->
        some html content here 2 bottom
    <!--endblock-->
    some html content here 1 bottom
<!--endblock-->
some html content here1-5 bottom base

some html content here 6-8 top base
<!--block:six-->
    some html content here 6 top
    <!--block:seven-->
        some html content here 7 top
        <!--block:eight-->
            some html content here 8a
            some html content here 8b
        <!--endblock-->
        some html content here 7 bottom
    <!--endblock-->
    some html content here 6 bottom
<!--endblock-->
some html content here 6-8 bottom base

Output >>

Base======================
some html content here top base

some html content here1-5 bottom base

some html content here 6-8 top base

some html content here 6-8 bottom base

First======================

    <table border="1" style="color:red;">
    <tr class="lines">
        <td align="left" valign="<--valign-->">
    <b>bold</b><a href="http://www.mewsoft.com">mewsoft</a>
    <!--hello--> <--again--><!--world-->
    some html content here 1 top

    some html content here 1 bottom

Second======================

        some html content here 2 top

        some html content here 2 bottom

Third======================

            some html content here 3 top

            some html content here 3a
            some html content here 3b

Fourth======================

                some html content here 4 top


Fifth======================

                    some html content here 5a
                    some html content here 5b

Six======================

    some html content here 6 top

    some html content here 6 bottom

Seven======================

        some html content here 7 top

        some html content here 7 bottom

Eight======================

            some html content here 8a
            some html content here 8b
  • Good job on error checking, appears to work. And overall, nice job on the regex. As a fan and expert of regexes as well, I appreciate the skill that it takes to craft one. However, benchmarking 100,000 iterations shows that yours takes 46.68 secs versus mine at 6.29 seconds. Mine is more than 700% faster, and doesn't take expert skills in regex to understand. Therefore I stand by all my assertions. – Miller Jun 10 '14 at 03:26