How to match a html tag with perl regex?

Question

Given the code below, I want to match the first form occurrence. I found out that negative lookahead ?! may be used to achieve that but it doesn't work. What's wrong with my regex?

#test
$test = "<form abc> foo </form> <form gg> bar </form>";
$test =~ m/<form[^>]*abc[^>]*>(?!.*form>.*)form>/s;
print $&;

Please answer how to do that, not not to do that. I just wanna learn regex. — Marcin Król, Aug 19 '12 at 21:57
The first step of learning regexps is accepting that there are things they are _not_ good for on their own. Parsing complex languages with nesting structures such as HTML is one of those things. — hmakholm left over Monica, Aug 19 '12 at 21:59
just tell me how to make this example work.... i know there must obviously be some librares to do this work but i wanna do this with regex — Marcin Król, Aug 19 '12 at 22:01
is it even possible with regex? its not about html but other similar cases it could be useful. — Marcin Król, Aug 19 '12 at 22:09
@HenningMakholm Not true. *Regular Expressions* in the CS meaning are near useless, of course, but *Perl Regexes* are a different story … they are fully fledged top-down recursive parsers, and can include arbitrary code inside the regex itself (with modern perls). I can produce patterns from inside the pattern. I can parse HTML. QED. (Not that I should or would) — amon, Aug 19 '12 at 22:45
@amon: But just because you can doesn't mean it's a good way to do it. — hmakholm left over Monica, Aug 19 '12 at 23:40
The OP needs to find a substring matching certain pattern from a larger string. How is regular expression not the right tool? — cleong, Aug 20 '12 at 01:37

amon · Accepted Answer · 2012-08-20T00:36:11.457

First, before explaining the regex: Use a module like HTML::TreeBuilder to create a document tree, then fetch your information from there. Parsing HTML with regexes is too error prone to use in the real world.

The Problem with your regex

Here is your string:

"<form abc> foo </form> <form gg> bar </form>"

And your regex (written expanded for readability, as with the /x flag):

<form [^>]* abc [^>]* > (?! .* form> .* ) form>

<form anchores when the literal character sequence is found
[^>]* searches for a number of non-> characters. Initially it matches abc
abc matches the literal character sequence abc. But because the regexp engine currently sees a > it has to backtrack, until [^>]* matches .
[^>]* will match nothing, as the engine sees a >
> matches the >
The negative lookahead matches, when the expression .* form .* would not match.
- The .* would consume all characters until end of string.
- form> causes the engine to backtrack until the .* matches foo </form> <form gg> bar </.
- The .* matches nothing, but that is okay.

So the lookahead succeeds, but it is a negative lookahead, so the assertion failes. The last part of the Regex will not even be executed.

Strategies

The .* consumes too many chararacters in our case. This is called greedy matching.

Non-greedy matching is written with a trailing ? like .*?. This version consumes zero characters initially and first checks the next part of the pattern. If that doesn't work, it consumes another character iteratively until there is a match.

A better Regex

<form [^>]* > .*? </form>

Inside the opening tag, only non-> characters are allowed. Between the tags, any character is allowed. We do non-greedy matching, so the first end tag matches and ends the regex.

However, this solution is a bit problematic. A tolerant HTML parser would not choke on a attr="val<u>e". We will. Also, the first </form> is matched, which is undesirable in the event that we have nested forms. While unproblematic in this use case, this regex is totally useless when matching <div>s or the like.

Regexp Grammars

Perl regexes are incredibly powerful and allow you to declare recursive grammars. The built-in syntax is a bit akward, but I recommend the Regexp::Grammars module to do that easily. Better yet, simply use a fully-fledged HTML Parser already lying around.

Fetching the match

The use of $& (and $` and $') is discouraged, as it makes perl incredibly inefficient. This won't manifest itself in a small script, but its bad style anyway. Rather enclose your whole Regexp with parens to capture the match

m{ ( <form [^>]* > .*? </form> ) }

and then use $1.

The perlretut Tutorial may be a good introduction to understand Perl regexes.