I must repeat for you and anyone else who might find this thread, do not use regular expressions in such a complicated way.
I love regular expressions, but they were not designed for this sort of problem. You're going to be 1,000 times better off using a standard templating system like Template::Toolkit
.
The problem with regular expressions in this context is there's a tendency to couple parsing with validation. By doing that, the regex ends up being very fragile and it's common for people to skip validation of their data entirely. For example, when a recursive regex sees ((( ))
, it will claim there are only 2 levels to those parenthesis. In truth, there are 2 and a 1/2, and that 1/2 is an error that won't be reported.
Now, I already communicated the way to avoid this flaw in regex parsing in my answers to two of your other questions:
Basically, make your parsing regex as simple as possible. This serves multiple purposes. It ensures that your regex will be less fragile, and it also encourages not putting the validation in the parsing phase.
I showed you how start this particular stackoverflow problem in the second above solution. Basically, tokenize your data, and then translate the results into your more complicated data structure. I've had some spare time today, so have decided to actually fully demonstrate how that translation can be easily done:
use strict;
use warnings;
use Data::Dump qw(dump dd);
my $content = do {local $/; <DATA>};
# Tokenize Content
my @tokens = split m{<!--(?:block:(.*?)|(endblock))-->}, $content;
# Resulting Data Structure
my @data = (
shift @tokens, # First element of split is always HTML
);
# Keep track of levels of content
# - This is a throwaway data structure to facilitate the building of nested content
my @levels = ( \@data );
while (@tokens) {
# Tokens come in groups of 3. Two capture groups in split delimiter, followed by html.
my ($block, $endblock, $html) = splice @tokens, 0, 3;
# Start of Block - Go up to new level
if (defined $block) {
#debug# print +(' ' x @levels) ."<$block>\n";
my $hash = {
block => $block,
content => [],
};
push @{$levels[-1]}, $hash;
push @levels, $hash->{content};
# End of Block - Go down level
} elsif (defined $endblock) {
die "Error: Unmatched endblock found before " . dump($html) if @levels == 1;
pop @levels;
#debug# print +(' ' x @levels) . "</$levels[-1][-1]{block}>\n";
}
# Append HTML content
push @{$levels[-1]}, $html;
}
die "Error: Unmatched start block: $levels[-2][-1]{block}" if @levels > 1;
dd @data;
__DATA__
some html content here top base
<!--block:first-->
<table border="1" style="color:red;">
<tr class="lines">
<td align="left" valign="<--valign-->">
<b>bold</b><a href="http://www.mewsoft.com">mewsoft</a>
<!--hello--> <--again--><!--world-->
some html content here 1 top
<!--block:second-->
some html content here 2 top
<!--block:third-->
some html content here 3 top
<!--block:fourth-->
some html content here 4 top
<!--block:fifth-->
some html content here 5a
some html content here 5b
<!--endblock-->
<!--endblock-->
some html content here 3a
some html content here 3b
<!--endblock-->
some html content here 2 bottom
<!--endblock-->
some html content here 1 bottom
<!--endblock-->
some html content here1-5 bottom base
some html content here 6-8 top base
<!--block:six-->
some html content here 6 top
<!--block:seven-->
some html content here 7 top
<!--block:eight-->
some html content here 8a
some html content here 8b
<!--endblock-->
some html content here 7 bottom
<!--endblock-->
some html content here 6 bottom
<!--endblock-->
some html content here 6-8 bottom base
If you uncomment the debugging statements, you'll observe the following traversal of the tokens to builds the structure that you want:
<first>
<second>
<third>
<fourth>
<fifth>
</fifth>
</fourth>
</third>
</second>
</first>
<six>
<seven>
<eight>
</eight>
</seven>
</six>
And the fully resulting data structure is:
(
"\nsome html content here top base\n",
{
block => "first",
content => [
"\n <table border=\"1\" style=\"color:red;\">\n <tr class=\"lines\">\n <td align=\"left\" valign=\"<--valign-->\">\n <b>bold</b><a href=\"http://www.mewsoft.com\">mewsoft</a>\n <!--hello--> <--again--><!--world-->\n some html content here 1 top\n ",
{
block => "second",
content => [
"\n some html content here 2 top\n ",
{
block => "third",
content => [
"\n some html content here 3 top\n ",
{
block => "fourth",
content => [
"\n some html content here 4 top\n ",
{
block => "fifth",
content => [
"\n some html content here 5a\n some html content here 5b\n ",
],
},
"\n ",
],
},
"\n some html content here 3a\n some html content here 3b\n ",
],
},
"\n some html content here 2 bottom\n ",
],
},
"\n some html content here 1 bottom\n",
],
},
"\nsome html content here1-5 bottom base\n\nsome html content here 6-8 top base\n",
{
block => "six",
content => [
"\n some html content here 6 top\n ",
{
block => "seven",
content => [
"\n some html content here 7 top\n ",
{
block => "eight",
content => [
"\n some html content here 8a\n some html content here 8b\n ",
],
},
"\n some html content here 7 bottom\n ",
],
},
"\n some html content here 6 bottom\n",
],
},
"\nsome html content here 6-8 bottom base",
);
Now, why is this method better?
It's less fragile. You already observed how in your previous regex was broken when other html comments were in the content. The tools used to parse here are extremely simple and so there is much less risk of the regex hiding edge cases.
Additionally, it's extremely easy to add functionality to this code. If you wanted to include parameters in your blocks, you could do it the exact same way as demonstrated in my solution to this problem of yours. The parsing and validation functionality wouldn't even have to be changed.
It reports errors Remove a character from either 'endblock' or 'block' and see what happens. It will give you an explicit error message:
Error: Unmatched start block: first at h.pl line 43
Your recursive regex would just hide the fact that there was an unmatched block in your content. You of course might observe it in your browser when you ran your code, but this way the error is reported immediately and you can track it down.
Summary:
Finally, I will say again, that the best way to solve this problem is not to try to create your own templating system, but to instead use an already created framework such as Template::Toolkit
. You commented before that one of your motivations was that you wanted to use a design editor for your templates and that's why you wanted them to use html comments for the templates. However, there are ways to accommodate that desire too with existing frameworks.
Regardless, I hope that you're able to learn something from this code. Recursive regular expressions are cool tools, and great for validating data. But they should not be used for parsing, and hopefully anyone else who is searching for how to use recursive regular expressions will pause and potentially rethink their approach if they are wanting them for that reason.