-2

I have this html code block:

some html content here top base
<!--block:first-->
    some html content here 1 top
    <!--block:second-->
        some html content here 2 top
        <!--block:third-->
            some html content here 3a
            some html content here 3b
        <!--endblock-->
        some html content here 2 bottom
    <!--endblock-->
    some html content here 1 bottom
<!--endblock-->
some html content here bottom base

and I got this regex to match the nested blocks:

/(?is)<!--block:(.*?)-->((?:(?:(?!<!--(?:.*?)-->).)|(?R))*?)<!--endblock-->|((?:(?!<!--.*?-->).)+)/g

This work fine but breaks if there is a comments inside any block contents.

This will fail because of the <!--comment--> in the first match only but the rest of matches will work fine:

<!--block:first-->
    some html content here 1 top
    this <!--comment--> will make it fail here.
    <!--block:second-->
        some html content here 2 top
        <!--block:third-->
            some html content here 3a
            some html content here 3b
        <!--endblock-->
        some html content here 2 bottom
    <!--endblock-->
    some html content here 1 bottom
<!--endblock-->
some html content here bottom base

This is a follow up on this question.

The Perl test code below:

use Data::Dumper;

$/ = undef;
my $content = <DATA>;


my %blocks = ();
$blocks{'base'} = [];


ParseCore( $blocks{'base'}, $content );


sub ParseCore
{
    my ($aref, $core) = @_;
    while ( $core =~ /(?is)<!--block:(.*?)-->((?:(?:(?!<!--(?:.*?)-->).)|(?R))*?)<!--endblock-->|((?:(?!<!--.*?-->).)+)/g )
    {
        if ( defined $1 )
        {
           my $branch = {};
           push @{$aref}, $branch;
           $branch->{$1} = [];
           ParseCore( $branch->{$1}, $2 );
        }
        elsif ( defined $3 )
        {
           push @{$aref}, $3;
        }
    }

}

print Dumper(\%blocks);

__DATA__

some html content here top base
<!--block:first-->
    some html content here 1 top
    this <!--comment--> will make it fail here.
    <!--block:second-->
        some html content here 2 top
        <!--block:third-->
            some html content here 3a
            some html content here 3b
        <!--endblock-->
        some html content here 2 bottom
    <!--endblock-->
    some html content here 1 bottom
<!--endblock-->
some html content here bottom base
Community
  • 1
  • 1
daliaessam
  • 1,636
  • 2
  • 21
  • 43
  • 4
    Mojo::DOM can find comments, and won't force you to waste hours fiddling with fragile regexes. – DavidO Jun 04 '14 at 00:46
  • I have to use regex to fit my exact needs. – daliaessam Jun 04 '14 at 00:47
  • @daliaessam: Okay I'll bite. What special needs are these that a regex can fulfil but a proper HTML parser can't? – Borodin Jun 04 '14 at 00:59
  • @Borodin Among reasons, Mojo is a complete framework that I am not going to load and use for such small task, I already use Moose in my project. – daliaessam Jun 04 '14 at 01:05
  • What are these exact needs? – Casimir et Hippolyte Jun 04 '14 at 01:08
  • @Casimir-et-Hippolyte I need this exact style to mark blocks of html code as iterators inside templates. – daliaessam Jun 04 '14 at 01:11
  • 1
    In this case you only need to be more precise (must define) with what kind of block could be a "self closing comment block" and what kind of block has always the behaviour of an html tag (need a closing tag). – Casimir et Hippolyte Jun 04 '14 at 01:22
  • Don't forget that you can easily define cases with the `(?(DEFINE) (? ...) (?...)` – Casimir et Hippolyte Jun 04 '14 at 01:25
  • @Casimir-et-Hippolyte how this can be applied to my issue. – daliaessam Jun 04 '14 at 01:33
  • The idea is to know if a `` is a self closing tag or not. To know that the only way is to choose if this tag is a self closing tag or not. (you are the only person who know that). – Casimir et Hippolyte Jun 04 '14 at 01:39
  • The question is why this regex `` is greedy, should't it be none greedy and stop at the first -->, this is the only problem, we need to make the (.*?) to stop at the first --> and not at the last one. – daliaessam Jun 04 '14 at 01:44
  • If it is the only problem, why you don't use `((?>[^-]+|-(?!->))*)` instead of `(.*?)`? – Casimir et Hippolyte Jun 04 '14 at 01:48
  • I just tried it and does not work `/((?:(?:(?!).)|(?R))*?)/isg` – daliaessam Jun 04 '14 at 01:58
  • Mojo::DOM is not a complete framework. What do you think the downside of using it is? Do you have some sort of measurable performance problem when using Mojo::DOM? If you don't have an actual problem, I suggest you go with what works. – Andy Lester Jun 04 '14 at 02:12
  • 3
    @daliaessam Your task is so small and nimble that you feel it would be a waste to use a working, tested and ready made module for it, which should introduce zero additional load, but you prefer to have us debug your regex forcing you to spend possibly hours reading answers and comments and with not the slightest guarantee that the (at most tested once) result will actually work? The good thing is, future readers coming across this might not be as opposed to using modules in Perl and will find their answer quickly. – DeVadder Jun 04 '14 at 07:23
  • @DeVadder Mojo::Dom that everyone is talking about is using this same regex to parse. just look at Mojo::DOM::HTML source code, again the Mojo::Dom loads tens of modules/files which I am not going to use for such small task. – daliaessam Jun 04 '14 at 12:05

2 Answers2

4

I know you must be tired of hearing this: but you're doing it wrong.

I love regular expressions, but they were not designed for this sort of problem. You're going to be 1,000 times better off using a standard templating system like Template::Toolkit.

If you're stuck with this approach, then I would suggest that you use simpler tools. Instead of trying to get a regex enforce all of your rules, use the most basic regex possible. In this case, I suggest that you tokenize your text using split:

use strict;
use warnings;

my $content = do {local $/; <DATA>};

my @tokens = split /(<!--(?:block:.*?|endblock)-->)/, $content;

use Data::Dump;
dd \@tokens;

__DATA__

some html content here top base
<!--block:first-->
    some html content here 1 top
    this <!--comment--> will make it fail here.
    <!--block:second-->
        some html content here 2 top
        <!--block:third-->
            some html content here 3a
            some html content here 3b
        <!--endblock-->
        some html content here 2 bottom
    <!--endblock-->
    some html content here 1 bottom
<!--endblock-->
some html content here bottom base

Outputs:

[
  "\nsome html content here top base\n",
  "<!--block:first-->",
  "\n    some html content here 1 top\n    this <!--comment--> will make it fail here.\n    ",
  "<!--block:second-->",
  "\n        some html content here 2 top\n        ",
  "<!--block:third-->",
  "\n            some html content here 3a\n            some html content here 3b\n        ",
  "<!--endblock-->",
  "\n        some html content here 2 bottom\n    ",
  "<!--endblock-->",
  "\n    some html content here 1 bottom\n",
  "<!--endblock-->",
  "\nsome html content here bottom base",
]

As you can see, the array contains an alternation between text and one of your matched patterns.

Now, I don't know what your final goal is, nor what format you want your data in the end, so I can't make any suggestions from here. But you could pretty easily recreate your original data structure if that actually served your needs. And even better, you can actually perform error checking that will look for blocks without matching open or closes, which your original regex would hide from you.

Addendum

I have provided an expanded full solution to this approach at Perl replace nested blocks regular expression

Community
  • 1
  • 1
Miller
  • 34,962
  • 4
  • 39
  • 60
  • The main idea is, I want to mark some html code blocks as iterators for template processing and at the same time do not break the visual html editors, so I can use this template code `
    $fname$lname
    ` now the block '`user` is an iterator and can be processed and replaced with a list of users names.
    – daliaessam Jun 04 '14 at 01:39
  • this will not work also if you have a repeated blocks of nested blocks, it will be very hard to rebuild the data again, the idea is good but not a solutions for this. I think I am going to delete this question soon. – daliaessam Jun 05 '14 at 15:49
  • Of course this will work if you have repeated nested blocks. I have created a [full solution](http://stackoverflow.com/a/24101864/1733163) at the original post you made concerning this subject. Also I'm a little insulted that you'd want to delete this question, but it's not even possible because this answer is upvoted. – Miller Jun 08 '14 at 00:35
  • you are not insulted, since the solution is provided in the second original post, after I posted the solution there and here I found it will be a duplicate or confusing. – daliaessam Jun 08 '14 at 05:38
0

Despite the advises to use Templates or parsers modules, I assure there is no one of those modules that can handle this direct.

Here is the solution I came up with.

First I find the outer blocks in the entire content or template with simple regex:

/(<!--block:.*?-->(?:(?:(?!<!--(?:.*?)-->).)|(?R))*?<!--endblock-->)/gis

Then I parse each outer block to find its nested sub blocks.

/(<!--block:(.*?)-->((?:(?:(?!<!--block:(?:.*?)-->).)|(?R))*?)<!--endblock-->|((?:(?!<!--.*?-->).)+))/igsx

Then everything is working well. I tested with two outer blocks and each one has nested blocks.

I can reach any sub block simply like that:

print $blocks->{first}->{content};

print $blocks->{first}->{match};

print $blocks->{first}->{second}->{third}->{fourth}->{content}

Each block hash ref has the keys:

`content`: the block content without the block name and endblock tags.
`match`: the block content with the block name and endblock tags, good for replacing.
`#next`: has the sub block name if exists, good to check if block has children and access them.

Below is the final Perl tested and working code.

use Data::Dumper;

$/ = undef;
my $content = <DATA>;

my $blocks = parse_blocks($content);

print Dumper($blocks);

#print join "\n", keys( %{$blocks->{first}}); # root blocks names
#print join "\n", keys( %{$blocks->{first}}); # 
#print join "\n", keys( %{$blocks->{first}->{second}});

#print Dumper $blocks->{first};
#print Dumper $blocks->{first}->{content};
#print Dumper $blocks->{first}->{match};

# check if fourth block has sub block.
#print exists $blocks->{first}->{second}->{third}->{fourth}->{'#next'}, "\n";

# check if block has sub block, get it:
#if (exists $blocks->{first}->{second}->{third}->{fourth}->{'#next'}) {
#   print $blocks->{first}->{second}->{third}->{fourth}->{ $blocks->{first}->{second}->{third}->{fourth}->{'#next'} }->{content}, "\n";
#}

exit;
#================================================
sub parse_blocks {
    my ($content) = @_;
    my $href = {};
    # find outer blocks only
    while ($content =~ /(<!--block:.*?-->(?:(?:(?!<!--(?:.*?)-->).)|(?R))*?<!--endblock-->)/gis) {
        # parse each outer block nested blocks
        parse_nest_blocks($href, $1);
    }
    return $href;
}
#================================================
sub parse_nest_blocks {
    my ($aref, $core) = @_;
    my ($k, $v);
    while ( $core =~ /(<!--block:(.*?)-->((?:(?:(?!<!--block:(?:.*?)-->).)|(?R))*?)<!--endblock-->|((?:(?!<!--.*?-->).)+))/igsx )
    {
        if (defined $2) {
           $k = $2; $v = $3;
           $aref->{$k} = {};
           $aref->{$k}->{content} = $v;
           $aref->{$k}->{match} = $1;
           #print "1:{{$k}}\n2:[[$v]]\n";
           my $curraref = $aref->{$k};
           my $ret = parse_nest_blocks($aref->{$k}, $v);
           if ($ret) {
               $curraref->{'#next'} = $ret;
           }
           return $k;
        }
    }

}
#================================================
__DATA__
some html content here top base
<!--block:first-->
    <table border="1" style="color:red;">
    <tr class="lines">
        <td align="left" valign="<--valign-->">
    <b>bold</b><a href="http://www.mewsoft.com">mewsoft</a>
    <!--hello--> <--again--><!--world-->
    some html content here 1 top
    <!--block:second-->
        some html content here 2 top
        <!--block:third-->
            some html content here 3 top
            <!--block:fourth-->
                some html content here 4 top
                <!--block:fifth-->
                    some html content here 5a
                    some html content here 5b
                <!--endblock-->
            <!--endblock-->
            some html content here 3a
            some html content here 3b
        <!--endblock-->
        some html content here 2 bottom
    <!--endblock-->
    some html content here 1 bottom
<!--endblock-->
some html content here1-5 bottom base

some html content here 6-8 top base
<!--block:six-->
    some html content here 6 top
    <!--block:seven-->
        some html content here 7 top
        <!--block:eight-->
            some html content here 8a
            some html content here 8b
        <!--endblock-->
        some html content here 7 bottom
    <!--endblock-->
    some html content here 6 bottom
<!--endblock-->
some html content here 6-8 bottom base

and the output of the entire hash dump is:

$VAR1 = {
          'first' => {
                       'second' => {
                                     'third' => {
                                                  'match' => '<!--block:third-->
            some html content here 3 top
            <!--block:fourth-->
                some html content here 4 top
                <!--block:fifth-->
                    some html content here 5a
                    some html content here 5b
                <!--endblock-->
            <!--endblock-->
            some html content here 3a
            some html content here 3b
        <!--endblock-->',
                                                  'content' => '
            some html content here 3 top
            <!--block:fourth-->
                some html content here 4 top
                <!--block:fifth-->
                    some html content here 5a
                    some html content here 5b
                <!--endblock-->
            <!--endblock-->
            some html content here 3a
            some html content here 3b
        ',
                                                  'fourth' => {
                                                                'fifth' => {
                                                                             'match' => '<!--block:fifth-->
                    some html content here 5a
                    some html content here 5b
                <!--endblock-->',
                                                                             'content' => '
                    some html content here 5a
                    some html content here 5b
                '
                                                                           },
                                                                'match' => '<!--block:fourth-->
                some html content here 4 top
                <!--block:fifth-->
                    some html content here 5a
                    some html content here 5b
                <!--endblock-->
            <!--endblock-->',
                                                                'content' => '
                some html content here 4 top
                <!--block:fifth-->
                    some html content here 5a
                    some html content here 5b
                <!--endblock-->
            ',
                                                                '#next' => 'fifth'
                                                              },
                                                  '#next' => 'fourth'
                                                },
                                     'match' => '<!--block:second-->
        some html content here 2 top
        <!--block:third-->
            some html content here 3 top
            <!--block:fourth-->
                some html content here 4 top
                <!--block:fifth-->
                    some html content here 5a
                    some html content here 5b
                <!--endblock-->
            <!--endblock-->
            some html content here 3a
            some html content here 3b
        <!--endblock-->
        some html content here 2 bottom
    <!--endblock-->',
                                     'content' => '
        some html content here 2 top
        <!--block:third-->
            some html content here 3 top
            <!--block:fourth-->
                some html content here 4 top
                <!--block:fifth-->
                    some html content here 5a
                    some html content here 5b
                <!--endblock-->
            <!--endblock-->
            some html content here 3a
            some html content here 3b
        <!--endblock-->
        some html content here 2 bottom
    ',
                                     '#next' => 'third'
                                   },
                       'match' => '<!--block:first-->
    <table border="1" style="color:red;">
    <tr class="lines">
        <td align="left" valign="<--valign-->">
    <b>bold</b><a href="http://www.mewsoft.com">mewsoft</a>
    <!--hello--> <--again--><!--world-->
    some html content here 1 top
    <!--block:second-->
        some html content here 2 top
        <!--block:third-->
            some html content here 3 top
            <!--block:fourth-->
                some html content here 4 top
                <!--block:fifth-->
                    some html content here 5a
                    some html content here 5b
                <!--endblock-->
            <!--endblock-->
            some html content here 3a
            some html content here 3b
        <!--endblock-->
        some html content here 2 bottom
    <!--endblock-->
    some html content here 1 bottom
<!--endblock-->',
                       'content' => '
    <table border="1" style="color:red;">
    <tr class="lines">
        <td align="left" valign="<--valign-->">
    <b>bold</b><a href="http://www.mewsoft.com">mewsoft</a>
    <!--hello--> <--again--><!--world-->
    some html content here 1 top
    <!--block:second-->
        some html content here 2 top
        <!--block:third-->
            some html content here 3 top
            <!--block:fourth-->
                some html content here 4 top
                <!--block:fifth-->
                    some html content here 5a
                    some html content here 5b
                <!--endblock-->
            <!--endblock-->
            some html content here 3a
            some html content here 3b
        <!--endblock-->
        some html content here 2 bottom
    <!--endblock-->
    some html content here 1 bottom
',
                       '#next' => 'second'
                     },
          'six' => {
                     'match' => '<!--block:six-->
    some html content here 6 top
    <!--block:seven-->
        some html content here 7 top
        <!--block:eight-->
            some html content here 8a
            some html content here 8b
        <!--endblock-->
        some html content here 7 bottom
    <!--endblock-->
    some html content here 6 bottom
<!--endblock-->',
                     'content' => '
    some html content here 6 top
    <!--block:seven-->
        some html content here 7 top
        <!--block:eight-->
            some html content here 8a
            some html content here 8b
        <!--endblock-->
        some html content here 7 bottom
    <!--endblock-->
    some html content here 6 bottom
',
                     'seven' => {
                                  'match' => '<!--block:seven-->
        some html content here 7 top
        <!--block:eight-->
            some html content here 8a
            some html content here 8b
        <!--endblock-->
        some html content here 7 bottom
    <!--endblock-->',
                                  'content' => '
        some html content here 7 top
        <!--block:eight-->
            some html content here 8a
            some html content here 8b
        <!--endblock-->
        some html content here 7 bottom
    ',
                                  'eight' => {
                                               'match' => '<!--block:eight-->
            some html content here 8a
            some html content here 8b
        <!--endblock-->',
                                               'content' => '
            some html content here 8a
            some html content here 8b
        '
                                             },
                                  '#next' => 'eight'
                                },
                     '#next' => 'seven'
                   }
        };
daliaessam
  • 1,636
  • 2
  • 21
  • 43