Try something like:
<p>(?:(?!</?p>).)*</p>(?!(?:(?!</?p>).)*(<p>|$))
A quick break down:
<p>(?:(?!</?p>).)*</p>
matches <p> ... </p>
that does not contain either <p>
and </p>
. And the part:
(?!(?:(?!</?p>).)*(<p>|$))
is "true" when looking ahead ((?! ... )
) there is no <p>
or the end of the input ((<p>|$)
), without any <p>
and </p>
in between ((?:(?!</?p>).)*
).
A demo:
my $txt="<p>aaa aa a</p> <p>foo <p>bar</p> foo</p> <p> bb <p>x</p> bb</p>";
while($txt =~ m/(<p>(?:(?!<\/?p>).)*<\/p>)(?!(?:(?!<\/?p>).)*(<p>|$))/g) {
print "Found: $1\n";
}
prints:
Found: <p>bar</p>
Found: <p>x</p>
Note that this regex trickery only works for <p>baz</p>
in the string:
<p>foo <p>bar</p> <p>baz</p> foo</p>
<p>bar</p>
is not matched! After replacing <p>baz</p>
, you could do a 2nd run on the input, in which case <p>bar</p>
will be matched.