1

Versions of this have been asked several times on here, and using those I was able to get two different ReGex statements.

One that strips all HTML

1. <[^>]*>

And one that strips everything but the anchor tags

2. <a[^>]*>([^<]+)<\/a>

I have no hope of combining those to get a regex that strips all HTML but keeps the anchors so (1+!2). Therefore I'm currently going once trough my HTML with the first regex, and if I encounter a certain keyword that usually lives inside the anchors then I go trough the Body with the 2nd regex and combine both.

That clearly is not ideal and will most likely miss many anchors.

What would a single regex that matches all HTML but the anchors look like ? /1?!2/

Test data: https://www.regextester.com/?fam=105725 I need everything that is ALL CAPS and the anchor around it.

Иво Недев
  • 1,570
  • 1
  • 20
  • 33
  • 2
    I do not see any question mark? There are expressions that might be doing that but please provide your tool/programming language as well. – Jan Nov 01 '18 at 09:37
  • 1
    See https://regex101.com/r/ISrz6O/1 for engines that support `(*SKIP)(*FAIL)` but be warned that it is error-prone with nested structures (such as `HTML` that is). – Jan Nov 01 '18 at 09:47
  • @Jan my bad assuming Regex is the same everywhere. This one fails for me with a "Quantifier {x,y} following nothing." but thanks will look into it. – Иво Недев Nov 01 '18 at 09:52
  • What about <[^a][^>]*>([^<]+)<\/[^a]> ? – quant Nov 01 '18 at 10:03
  • @quant It matches only the p tags and definitely not the anchor. Added test data to help visualise the problem in my question. – Иво Недев Nov 01 '18 at 10:08
  • 2
    Surprising that no one has yelled "Don't parse HTML with RegEx!" ;) [This question almost always comes up](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – SamWhan Nov 01 '18 at 10:15
  • @SamWhan I'd love to use a 3rd party tool or go about it in a different way but due to constraints I have to settle to using regex. Luckily my HTML isn't going to vary much and even in it's current state the code works I'm just trying to improve it. – Иво Недев Nov 01 '18 at 10:27

1 Answers1

3

Disregarding my own comment ;) - Is this what you're after?

Replace

<((?!a|\/a)[^>]*)>\s*

with empty string.

The negative look-ahead after the opening < makes sure it ignores anchors.

Here at regex101.

SamWhan
  • 8,296
  • 1
  • 18
  • 45
  • Yep. If it works on the test data, which it did, it is what I'm after. Thank you. – Иво Недев Nov 01 '18 at 10:28
  • 1
    Certainly <((?!a|\/a)[^>]*)>\s* is a good start, so +1, however it is more tricky in detail. If you have a text like asdfgoog the or any other tag starting with " – quant Nov 01 '18 at 10:30
  • @quant Apart from span, head, meta and the usual suspects none of my html tags start with a, let alone contain a. So for my particular case it works perfectly. – Иво Недев Nov 01 '18 at 10:32
  • 1
    @quant It could be enhanced to not match `alt` and likes simply by adding a word boundary - `<((?!a\b|\/a\b)[^>]*)>\s*` – SamWhan Nov 01 '18 at 10:59
  • Sam you're a hero ... (but wait a minute, wasn't there already a hero named sam? https://en.wikipedia.org/wiki/Samy_%28computer_worm%29 ) – quant Nov 01 '18 at 11:14