0

I have requirement of following reg-ex pattern:

Sample string :

<html> a test of  strength and <h1> valour </h1> for <<<NOT>>> faint hearted <b> BUT </b> protoganist having their characters <<<CARVED>>> out of gibralter <b> ROCK </b>

This above is single string in which I want to strip out every HTML tag and retain <<<xyz>>> .

My attempt:

(^|\n| )<[^>]*>(\n| |$)

Can someone please critically review this ?

user692942
  • 16,398
  • 7
  • 76
  • 175
IrateINWIT
  • 39
  • 6
  • 1
    Have you considered using an HTML parser? It would make this task trivial. If you have a good reason not to, please tell us which environment (language, tool) you're running your regex in, as possible solutions will depend on that – Aaron Jan 21 '20 at 14:21
  • No, I have not considered it as am un-aware of it. My language is vbscript/vba . Any references for HTML Parser using vbscript ? – IrateINWIT Jan 21 '20 at 14:38
  • 1
    I'm not familiar with vbscript, but [this question](https://stackoverflow.com/questions/16629228/extract-text-between-html-tags) seems to address that topic well – Aaron Jan 21 '20 at 14:40
  • 1
    I'm voting to close this question as off-topic because it belongs on [Code Review](https://codereview.stackexchange.com/). – user692942 Jan 21 '20 at 14:56
  • 1
    Obligatory [parsing HTML with RegEx warning](https://stackoverflow.com/a/1732454/1014587). – Mast Jan 21 '20 at 15:05

1 Answers1

1

This is what I've come up with. It uses lookbehinds to make sure you identify hmtl tags by what will precede and follow them without actually including them in the match. The point is to look for < and > only if they are followed or preceded by spaces or letters (not other < or >). Is this what you are after or did I misread you?

(?=([ A-z]?))<{1}\/?[A-z1-6]+>{1}(?=[^>])
Matt Cremeens
  • 4,951
  • 7
  • 38
  • 67