Negative lookahead but with something before it

Question

I'm using a regex to parse some HTML I have the following regex which matches all tags except img and a.

 \<(?!img|a)[^\>]+\>

This works well but I also want it to match the closing tags, I've tried the following but it doesn't work:

 \</?(?!img|a)[^\>]+\>

What would be the best way to do this?

(Also before there is a plethora of comments saying not to use regexes to parse HTML I'd just like to say that this HTML is generated by a tool and is very uniform.)

EDIT:

 <p>So in this</p>
 <p>HTML <strong>with nested tags</strong></p>
 <p>It should remove <i>everything</i> except <a href="#">This link</a>
 and this <img src="#" alt="image" /> but it also needs to kep the textual content</p>

\?(?!img|a)[^\>].*?(?=\>) this would work. it ends with the first match of \? — Val Nolav, Aug 03 '14 at 20:46
you dont need to scape < or > since they are not special characters, you missed "/" and didnt scape this one... it should be \/? — Sirius_Black, Aug 03 '14 at 20:46
@JoeNFU can you give an example of what link you dont want to match? — Sirius_Black, Aug 03 '14 at 20:51
possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — jtbandes, Aug 03 '14 at 20:59
Your right about the escaping - I've added an example of the html — JoeS, Aug 03 '14 at 21:00
jtbandes - not a duplicate of that question since I'm saying to exclude certain tags. I do like the accepted answer though. — JoeS, Aug 03 '14 at 21:04
Hi Val it doesn't work (in expresso at least) just tweaking it now cos it seems almost there. EDIT: On second thoughts it won't match closing tags. The one above didn't work either. (It matched ' — JoeS, Aug 03 '14 at 21:13

score 0 · Answer 1 · answered Aug 03 '14 at 21:28

0

Ok here is a pretty wasteful solution:

   <(?!img|a|\/img|\/a)[^>]+>

It would be great if someone could find a better one.

answered Aug 03 '14 at 21:28

JoeS

1,405
17
30

score 0 · Accepted Answer · answered Aug 03 '14 at 22:26

0

I think that the simplest solution would be the following:

<\/?(?!img|a)[^>]+>

It simply matches:

a <,
a / (escaped with \) if there is any (quantifier ?),
asserts that there is neither img nor a,
a sequence of anything but > ([^>]+) and
a >

See it working here on regex101.

answered Aug 03 '14 at 22:26

ccjmne

9,333
3
47
62

Hmm ok that works, I've put it in to my javascript and it works too. It just doesn't work in Expresso - must be a bug. Thanks. – JoeS Aug 03 '14 at 22:46

Negative lookahead but with something before it

2 Answers2