1

How to find all strings between < and > but excluding some special tags like b,i,ul,ol,li,p. Is there a shorter solution to the following?

while ($html =~ /<(\w+)>/g) { 
  print "found $1\n" if $1 ne 'b' && $1 ne 'ul' && $1 ne 'p' ...
}
brian d foy
  • 129,424
  • 31
  • 207
  • 592
Codr
  • 368
  • 3
  • 12

2 Answers2

4

Can use a library, and Mojo::DOM makes it easy

use Mojo::DOM;

my $dom = Mojo::DOM->new($html);

for ( $dom->find(':not(b,i,ul,ol,li,p)')->each ) {
    say
}

Now you also have the HTML parsed and can process it as needed at will.

zdim
  • 64,580
  • 5
  • 52
  • 81
  • This answer limits the scope to HTML-like strings. The answer from Wiktor works also on any strings and any brackets or tags. – Codr Dec 12 '22 at 09:32
  • @Codr "_This answer limits the scope to HTML-like strings. The answer from Wiktor works also on any strings and any brackets or tags_" -- A _lot_ has been written on comparing these approaches. In general, it is strongly recommended to use libraries to parse html / xml (and other particular formats, like json, csv...), for very good reasons. But if one has data which isn't in that format ... well, then the point is moot as one can't use a library for a different format. – zdim Dec 20 '22 at 20:22
  • @Codr [cont'd] But you clearly do have an html snippet. Which of course _can_ be parsed with regex -- and you asked about regex, so you got an answer for that. So one post answers your direct question, the other one offers another approach instead. No point in comparing them. – zdim Dec 20 '22 at 20:23
2

You can use

while ($html =~ /<(?!(?:b|ul|p)>)(\w+)>/g) { 
  print "found $1\n" 
}

See the regex demo. Details:

  • < - a < char
  • (?!(?:b|ul|p)>) - a negative lookahead that fails the match if, immediately to the right of the current location, there is b, ul or p followed with a > char
  • (\w+) - Capturing group 1: one or more word chars
  • > - a > char.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563