Single Regex to strip all HTML but the anchors

Question

Versions of this have been asked several times on here, and using those I was able to get two different ReGex statements.

One that strips all HTML

1. <[^>]*>

And one that strips everything but the anchor tags

2. <a[^>]*>([^<]+)<\/a>

I have no hope of combining those to get a regex that strips all HTML but keeps the anchors so (1+!2). Therefore I'm currently going once trough my HTML with the first regex, and if I encounter a certain keyword that usually lives inside the anchors then I go trough the Body with the 2nd regex and combine both.

That clearly is not ideal and will most likely miss many anchors.

What would a single regex that matches all HTML but the anchors look like ? /1?!2/

Test data: https://www.regextester.com/?fam=105725 I need everything that is ALL CAPS and the anchor around it.

I do not see any question mark? There are expressions that might be doing that but please provide your tool/programming language as well. — Jan, Nov 01 '18 at 09:37
See https://regex101.com/r/ISrz6O/1 for engines that support `(*SKIP)(*FAIL)` but be warned that it is error-prone with nested structures (such as `HTML` that is). — Jan, Nov 01 '18 at 09:47
@Jan my bad assuming Regex is the same everywhere. This one fails for me with a "Quantifier {x,y} following nothing." but thanks will look into it. — Иво Недев, Nov 01 '18 at 09:52
@quant It matches only the p tags and definitely not the anchor. Added test data to help visualise the problem in my question. — Иво Недев, Nov 01 '18 at 10:08
Surprising that no one has yelled "Don't parse HTML with RegEx!" ;) [This question almost always comes up](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — SamWhan, Nov 01 '18 at 10:15
@SamWhan I'd love to use a 3rd party tool or go about it in a different way but due to constraints I have to settle to using regex. Luckily my HTML isn't going to vary much and even in it's current state the code works I'm just trying to improve it. — Иво Недев, Nov 01 '18 at 10:27

score 3 · Accepted Answer · answered Nov 01 '18 at 10:26

3

Disregarding my own comment ;) - Is this what you're after?

Replace

<((?!a|\/a)[^>]*)>\s*

with empty string.

The negative look-ahead after the opening < makes sure it ignores anchors.

Here at regex101.

answered Nov 01 '18 at 10:26

SamWhan

8,296
1
18
45

Yep. If it works on the test data, which it did, it is what I'm after. Thank you. – Иво Недев Nov 01 '18 at 10:28
1

Certainly <((?!a|\/a)[^>]*)>\s* is a good start, so +1, however it is more tricky in detail. If you have a text like asdfgoog the or any other tag starting with " – quant Nov 01 '18 at 10:30
@quant Apart from span, head, meta and the usual suspects none of my html tags start with a, let alone contain a. So for my particular case it works perfectly. – Иво Недев Nov 01 '18 at 10:32
1

@quant It could be enhanced to not match `alt` and likes simply by adding a word boundary - `<((?!a\b|\/a\b)[^>]*)>\s*` – SamWhan Nov 01 '18 at 10:59
Sam you're a hero ... (but wait a minute, wasn't there already a hero named sam? https://en.wikipedia.org/wiki/Samy_%28computer_worm%29 ) – quant Nov 01 '18 at 11:14

Single Regex to strip all HTML but the anchors

1 Answers1