0

I'm using a regex to parse some HTML I have the following regex which matches all tags except img and a.

 \<(?!img|a)[^\>]+\>

This works well but I also want it to match the closing tags, I've tried the following but it doesn't work:

 \</?(?!img|a)[^\>]+\>

What would be the best way to do this?

(Also before there is a plethora of comments saying not to use regexes to parse HTML I'd just like to say that this HTML is generated by a tool and is very uniform.)

EDIT:

 <p>So in this</p>
 <p>HTML <strong>with nested tags</strong></p>
 <p>It should remove <i>everything</i> except <a href="#">This link</a>
 and this <img src="#" alt="image" /> but it also needs to kep the textual content</p>
JoeS
  • 1,405
  • 17
  • 30
  • 3
    someone saying to not parse html with regex below... – Sirius_Black Aug 03 '14 at 20:43
  • \?(?!img|a)[^\>].*?(?=\>) this would work. it ends with the first match of \? – Val Nolav Aug 03 '14 at 20:46
  • you dont need to scape < or > since they are not special characters, you missed "/" and didnt scape this one... it should be \/? – Sirius_Black Aug 03 '14 at 20:46
  • @JoeNFU can you give an example of what link you dont want to match? – Sirius_Black Aug 03 '14 at 20:51
  • possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – jtbandes Aug 03 '14 at 20:59
  • Your right about the escaping - I've added an example of the html – JoeS Aug 03 '14 at 21:00
  • jtbandes - not a duplicate of that question since I'm saying to exclude certain tags. I do like the accepted answer though. – JoeS Aug 03 '14 at 21:04
  • <(?!(img|a)).+>.*?(?=>) try this please – Val Nolav Aug 03 '14 at 21:06
  • Hi Val it doesn't work (in expresso at least) just tweaking it now cos it seems almost there. EDIT: On second thoughts it won't match closing tags. The one above didn't work either. (It matched ' – JoeS Aug 03 '14 at 21:13

2 Answers2

0

Ok here is a pretty wasteful solution:

   <(?!img|a|\/img|\/a)[^>]+>

It would be great if someone could find a better one.

JoeS
  • 1,405
  • 17
  • 30
0

I think that the simplest solution would be the following:

<\/?(?!img|a)[^>]+>

It simply matches:

  • a <,
  • a / (escaped with \) if there is any (quantifier ?),
  • asserts that there is neither img nor a,
  • a sequence of anything but > ([^>]+) and
  • a >

See it working here on regex101.

ccjmne
  • 9,333
  • 3
  • 47
  • 62
  • Hmm ok that works, I've put it in to my javascript and it works too. It just doesn't work in Expresso - must be a bug. Thanks. – JoeS Aug 03 '14 at 22:46