Regex matching with exclusions of findings

Question

How to find all strings between < and > but excluding some special tags like b,i,ul,ol,li,p. Is there a shorter solution to the following?

while ($html =~ /<(\w+)>/g) { 
  print "found $1\n" if $1 ne 'b' && $1 ne 'ul' && $1 ne 'p' ...
}

https://stackoverflow.com/a/1732454/725418 – TLP Dec 09 '22 at 15:32 — TLP, Dec 09 '22 at 15:32

score 4 · Answer 1 · answered Dec 09 '22 at 10:46

4

Can use a library, and Mojo::DOM makes it easy

use Mojo::DOM;

my $dom = Mojo::DOM->new($html);

for ( $dom->find(':not(b,i,ul,ol,li,p)')->each ) {
    say
}

Now you also have the HTML parsed and can process it as needed at will.

answered Dec 09 '22 at 10:46

zdim

This answer limits the scope to HTML-like strings. The answer from Wiktor works also on any strings and any brackets or tags. – Codr Dec 12 '22 at 09:32
@Codr "_This answer limits the scope to HTML-like strings. The answer from Wiktor works also on any strings and any brackets or tags_" -- A _lot_ has been written on comparing these approaches. In general, it is strongly recommended to use libraries to parse html / xml (and other particular formats, like json, csv...), for very good reasons. But if one has data which isn't in that format ... well, then the point is moot as one can't use a library for a different format. – zdim Dec 20 '22 at 20:22
@Codr [cont'd] But you clearly do have an html snippet. Which of course _can_ be parsed with regex -- and you asked about regex, so you got an answer for that. So one post answers your direct question, the other one offers another approach instead. No point in comparing them. – zdim Dec 20 '22 at 20:23

score 2 · Answer 2 · answered Dec 09 '22 at 10:06

2

You can use

while ($html =~ /<(?!(?:b|ul|p)>)(\w+)>/g) { 
  print "found $1\n" 
}

See the regex demo. Details:

< - a < char
(?!(?:b|ul|p)>) - a negative lookahead that fails the match if, immediately to the right of the current location, there is b, ul or p followed with a > char
(\w+) - Capturing group 1: one or more word chars
> - a > char.

answered Dec 09 '22 at 10:06

Wiktor Stribiżew

Also possible is `while ($html =~ /<(?!b>|ul>|p>)(\w+)>/) { print "found $1\n" }` – Codr Dec 09 '22 at 11:16

2 Answers2