0

I have the below regex to parse XML tags inside html code blocks, I can not use xml libs, and it is working with some tests as expected, I just need some experts to optimize it if needed because I will use it to parse many blocks of code to build the whole template so it may be run average 50 times for each template and therefore every clock tick will count for me.

The regex for the XML tags I used is

(<vars\s*([^\!\?\s<>](?:"[^"]*"|'[^']*'|[^"'<>])*)>([^<]*)(<\!\[CDATA\[(.*?)\]\]>)?(</vars>)?)

then I parse the attributes with this regex:

([^\s\=\"\']+)\s*=\s*(?:(")(.*?)"|'(.*?)')

here is the test Perl code:

use strict;
use warnings;
no warnings 'uninitialized';

my $text = <<"END_HTML";
<vars type="var" name="selfopened" content="tag without closing slash" size="30" width="200px" >
<vars type="plug" name="selfclosed" content="self closed tag" size="30" width="200px" />
<vars type="var" name="hasclosing" width="400px" height="300px">content of tag with closing</vars>
<vars id="left-part" width="400px" height="300px"><![CDATA[ 
    cdata start here is may have html tags and 'single' and "double" qoutes
    another cdata line
]]></vars>
<vars name="singlelinecdata" width="400px" height="300px"><![CDATA[cdata start here is may have html tags and 'single' and "double" qoutes]]></vars>
</vars>
END_HTML

while ( $text =~ m{
    (<vars\s*([^\!\?\s<>](?:"[^"]*"|'[^']*'|[^"'<>])*)>([^<]*)(<\!\[CDATA\[(.*?)\]\]>)?(</vars>)?)
}sxgi ) {
    my ($match, $attrs, $value, $cdata, $cdata_content, $closing) = ( $1, $2, $3, $4, $5, $6 );
    print "match: $match, attrs: $attrs, value: $value, cdata: $cdata, closing: $closing\n\n";

    # parse attributes to key, value pairs
     while ( $attrs =~ m{
        ([^\s\=\"\']+)\s*=\s*(?:(")(.*?)"|'(.*?)')
    }sxg ) {
        my $key = $1;
        my $val = ( $2 ? $3 : $4 );
        print "attr: $key=$val\n";
    }
    print "\n";
}
Miller
  • 34,962
  • 4
  • 39
  • 60
daliaessam
  • 1,636
  • 2
  • 21
  • 43
  • 3
    I am sorry to have to say this, but your solution doesn't work as well as you think it does. For s start, your subexpression `(?:"[^"]*"|'[^']*'|[^"'<>])*` may as well be just `[^>]*` for all the good it's doing. Regular expressions are just about excusable for parsing *extremely simple* bits of XML, but this code contains several bugs waiting to reveal themselves. Please explain your reasons for avoiding an XML library. – Borodin Jun 01 '14 at 19:24
  • The code works now after Miller removed the extra lines I inserted to make it readable. As for the attrs regex, attrs values may be single quoted, double quoted or not quoted at all and it works. – daliaessam Jun 01 '14 at 20:09
  • The functionality of your code hasn't changed. Miller edited it just to make it readable. I assure you that it doesn't work as you think it does, and it is just a matter of luck that it seems to function with the data you have used. You *will* get errors if you use this code on live data, and you may not notice them because it could silently ignore some of the data. Please explain your reasons for avoiding an XML library. – Borodin Jun 01 '14 at 21:10
  • @Miller: Re *"also added closing tag to data so it will actually work"*, the tag was just `HTML` before and it was there in the code. Please chip in with your thoughts on this guy's question; it's clear he doesn't believe me. – Borodin Jun 01 '14 at 21:13
  • @Borodin: can you demonstrate data his/her code breaks on? That's the easiest way to convince. – ysth Jun 01 '14 at 21:53
  • 1
    @daliaessam: first rule of optimization: do not optimize until you know it isn't fast enough. – ysth Jun 01 '14 at 21:54
  • @Borodin just to clear I do not need nested xml tags, I just need one level xml tags so I do not think it will fail. – daliaessam Jun 02 '14 at 01:07

3 Answers3

2

I would strongly advise you to use a prebuilt framework for this. Not knowing your exact use cases, I can't advise you fully, but perhaps Template::Toolkit would work.

If you insist on trying to solve this using regular expressions, I'd advise you to make it more limitting:

  • Don't allow self-closing tags
  • Don't allow nested vars tags (maybe you already don't?)

Also, from a parsing perspective, I'd advise some design changes:

  • When pulling potential matches, be as permissive as possible.
  • Then when parsing the found matches, be as restrictive as possible with good error reporting

Something like the following is a reworking of your script:

use strict;
use warnings;

use Data::Dump;

my $text = do {local $/; <DATA>};

while ( $text =~ m{<vars\b(.*?)\s*>(.*?)</vars>}sxgi ) {
    my ($attr, $content) = ($1, $2);

    # Separate and validate Attributes:
    my %attr;
    while ($attr =~ /\G(?:\s+([^=]+)=(?:"([^"]*)"|'([^']*)'|(\S+))|(.+))/sg) {
        if (defined $5) {
            die "Invalid attribute found: <$5> in $attr";
        }
        $attr{$1} = $2 // $3 // $4;
    }

    # Do any processing of content here including anything for CDATA

    # Done:
    dd {
        attr => \%attr,
        content => $content,
    }
}

__DATA__
<vars type="var" name="selfopened" content="tag without closing slash" size="30" width="200px" ></vars>
<vars type="plug" name="selfclosed" content="self closed tag" size="30" width="200px" ></vars>
<vars type="var" name="hasclosing" width="400px" height="300px">content of tag with closing</vars>
<vars id="left-part" width="400px" height="300px"><![CDATA[ 
    cdata start here is may have html tags and 'single' and "double" qoutes
    another cdata line
]]></vars>
<vars name="singlelinecdata" width="400px" height="300px"><![CDATA[cdata start here is may have html tags and 'single' and "double" qoutes]]></vars>

Outputs:

{
  attr => {
    content => "tag without closing slash",
    name    => "selfopened",
    size    => 30,
    type    => "var",
    width   => "200px",
  },
  content => "",
}
{
  attr => {
    content => "self closed tag",
    name    => "selfclosed",
    size    => 30,
    type    => "plug",
    width   => "200px",
  },
  content => "",
}
{
  attr => { height => "300px", name => "hasclosing", type => "var", width => "400px" },
  content => "content of tag with closing",
}
{
  attr => { height => "300px", id => "left-part", width => "400px" },
  content => "<![CDATA[ \n    cdata start here is may have html tags and 'single' and \"double\" qoutes\n    another cdata line\n]]>",
}
{
  attr => { height => "300px", name => "singlelinecdata", width => "400px" },
  content => "<![CDATA[cdata start here is may have html tags and 'single' and \"double\" qoutes]]>",
}
Miller
  • 34,962
  • 4
  • 39
  • 60
1

2 comments:

From your question it looks like you're building a templating system. Why? There are many already written, that work well, some with caching (Template::Toolkit for example, as already mentioned)

Then if you really want to build your own, why use a semi-XML format?

If you want an XML syntax, then use a real parser. The XML spec is tiny by standard formats, but it's still 32 pages long. I am not sure it's worth it implementing it all for your templating system! For example I don't think your parser deals with more than 1 CDATA section in the content, which is allowed by XML, and in fact required if you want to include the string ']]>' in your CDATA. So essentially you are implementing a subset of XML that's not defined, except by the implementation of the parser. I am not sure that's going to be great for users, or for you when you will have to debug your regexp or improve it to accept more XML features. For reference, have a look at a shallow XML parser in Perl: http://www.cs.sfu.ca/~cameron/REX.html#AppA. It looks like it takes a bit more than a single regexp, and I don't think it even deals with entities.

Alternatively, you could use a format that is not XML, with a simple, unambiguous description that's less than 32 pages. It would be easier to parse and at the very least you would avoid comments telling you to use a proper parser ;--)

mirod
  • 15,923
  • 3
  • 45
  • 65
0

I reduced the xml tags to two, one self closing and the other has closing tag:

self closing xml tag example:

<vars type="mod" name="weather" city="knox" countery="usa" />

xml tag has closing tag example:

<vars type="var" name="hasclosing" width="400px" height="300px">content of tag with closing</vars>

then used one regex to match each one of these two tags:

(<vars\s*([^\!\?\s<>](?:"[^"]*"|'[^']*'|[^"'<>])*)/>)

(<vars(\s*[^\!\?\s<>](?:"[^"]*"|'[^']*'|[^"'<>\/])*)>(.*?)<\/vars>)

below is the Perl test code.

use strict;
use warnings;
no warnings 'uninitialized';

my ($match, $attrs, $value, $cdata, $cdata_content, $closing);

my $text = <<HTML;
{first_name} <vars name="first_name" /> {first_name_notes}
{last_name} <vars name="last_name" /> {last_name_notes}
{email} <vars type="var" name='email' /> {email_notes}
{website} <vars type="var" name="website" /> {website_notes}
<vars type="mod" name="weather" city="knox" countery="usa" />
<vars type="var" name="hasclosing" width="400px" height="300px">content of tag with closing</vars>
<vars type="action" name="Stats::Users->active" />

<vars id="left-part" width="400px" height="300px"><![CDATA[ 
    cdata start here is may have html tags and 'single' and "double" qoutes
    another cdata line
]]></vars>
<vars name="singlelinecdata" width="400px" height="300px" content="ahmed<b>class/subclass"><![CDATA[cdata start here is may have html tags and 'single' and "double" qoutes]]></vars>

HTML

    while ( $text =~ m{
        (<vars\s*([^\!\?\s<>](?:"[^"]*"|'[^']*'|[^"'<>])*)/>)|(<vars(\s*[^\!\?\s<>](?:"[^"]*"|'[^']*'|[^"'<>\/])*)>(.*?)<\/vars>)
    }sxgi ) {

        if ($1) {
            ($match, $attrs, $value) = ($1, $2, undef);
        }
        else {
            ($match, $attrs, $value) = ( $3, $4, $5);
        }
        print "match:\n$match \nattrs: $attrs\nvalue: $value\n";

        # parse attributes to key and value pairs
         while ( $attrs =~ m{([^\s\=\"\']+)\s*=\s*(?:(")(.*?)"|'(.*?)')}sxg ) {
                my $key = $1;
                my $val = ( $2 ? $3 : $4 );
                print "attr: $key=$val\n";
        }

        print "\n";
    }

I still need some expert to optimize these regex's if possible.

daliaessam
  • 1,636
  • 2
  • 21
  • 43