First, before explaining the regex: Use a module like HTML::TreeBuilder
to create a document tree, then fetch your information from there. Parsing HTML with regexes is too error prone to use in the real world.
The Problem with your regex
Here is your string:
"<form abc> foo </form> <form gg> bar </form>"
And your regex (written expanded for readability, as with the /x
flag):
<form [^>]* abc [^>]* > (?! .* form> .* ) form>
<form
anchores when the literal character sequence is found
[^>]*
searches for a number of non->
characters. Initially it matches abc
abc
matches the literal character sequence abc
. But because the regexp engine currently sees a >
it has to backtrack, until [^>]*
matches
.
[^>]*
will match nothing, as the engine sees a >
>
matches the >
The negative lookahead matches, when the expression .* form .*
would not match.
The .*
would consume all characters until end of string.
form>
causes the engine to backtrack until the .*
matches foo </form> <form gg> bar </
.
The .*
matches nothing, but that is okay.
So the lookahead succeeds, but it is a negative lookahead, so the assertion failes. The last part of the Regex will not even be executed.
Strategies
The .*
consumes too many chararacters in our case. This is called greedy matching.
Non-greedy matching is written with a trailing ?
like .*?
. This version consumes zero characters initially and first checks the next part of the pattern. If that doesn't work, it consumes another character iteratively until there is a match.
A better Regex
<form [^>]* > .*? </form>
Inside the opening tag, only non->
characters are allowed. Between the tags, any character is allowed. We do non-greedy matching, so the first end tag matches and ends the regex.
However, this solution is a bit problematic. A tolerant HTML parser would not choke on a attr="val<u>e"
. We will. Also, the first </form>
is matched, which is undesirable in the event that we have nested forms. While unproblematic in this use case, this regex is totally useless when matching <div>
s or the like.
Regexp Grammars
Perl regexes are incredibly powerful and allow you to declare recursive grammars. The built-in syntax is a bit akward, but I recommend the Regexp::Grammars
module to do that easily. Better yet, simply use a fully-fledged HTML Parser already lying around.
Fetching the match
The use of $&
(and $`
and $'
) is discouraged, as it makes perl incredibly inefficient. This won't manifest itself in a small script, but its bad style anyway. Rather enclose your whole Regexp with parens to capture the match
m{ ( <form [^>]* > .*? </form> ) }
and then use $1
.
The perlretut
Tutorial may be a good introduction to understand Perl regexes.