How can I strip all HTML codes except _tag?

Question

I need to remove all HTML tags except:

it is <sub> tag
there is {1 (or more) newline(s) + 4 (or more) spaces} in the behind of it
it is surrounded into "`" character.

Here is an examples:

var str = "something1
           <sub>
             something2
             <div class='myclass'>something3</div>
           </sub>
           <div class='myclass'>something4</div>
           something5

               <div class='myclass'>something6</div>
           <div class='myclass'>something7</div>
           `<div>something8</div>`
           something9";

Expected output:

/*   
something1
<sub>
  something2
  something3
</sub>
something4
something5

    <div class='myclass'>something6</div>
`<div>something8</div>`
something9

Here is what I've tried so far:

/\n\s{0,3}<.*[^>]+|<sub>.*?<\/sub>|`.*?`/gm

before using regex on HTML, read http://stackoverflow.com/a/1732454/5053002 — Jaromanda X, Sep 04 '16 at 03:08
@JaromandaX Yes, agreed .. A HTML parser usually would be much better than regex for working on HTML. But I guess in this case regex is better. — Martin AJ, Sep 04 '16 at 03:13
Your `str` variable assignment isn't valid JS - is it just formatted like that to make it easier for us to read? — nnnnnn, Sep 04 '16 at 03:19
@nnnnnn yes exactly .. actually that's the value of a textarea. — Martin AJ, Sep 04 '16 at 03:20
Are you using JS in the backend? If yes I know npm validator can strip all HTML tags. I am not sure if you can allow certain tags. But if not, you can definitely create a new module and add your own method (pre-processing). I heard of npm striptags and npm string. They may allow certain tags. Read their doc. — user3207158, Sep 04 '16 at 03:52
@user3207158 ok, just as a note, I guess JS is front-end *(not back-end)* .. and yes I use it. thank you anyway. — Martin AJ, Sep 04 '16 at 03:58
@MartinAJ Not always..you can write JS code with nodejs in the backend — user3207158, Sep 04 '16 at 04:03
I don't think there's a *one regex to solve it all* here. You will probably have to split the thing by \n and then run each line, finding sub and what not. — A. L, Sep 04 '16 at 06:17

score 0 · Answer 1 · answered Sep 07 '16 at 15:35

This is possible with regex substitutions. Use this regex with mg modifiers:

(\n\n    .*|`[^`]+`|<\/?sub\b[^>]+>)|<[^>]+>

And use $1 as the substitution.

There are several parts to this. The capturing group finds all the HTML you may want to keep:

\n\n .* An empty line, and another line that starts with 4 spaces.
`[^`]+` Things in Back`Ticks.
<\/?sub\b[^>]+>) This matches sub HTML elements, opening or closing.

The remaining HTML elements will match <[^>]+>, which is discarded.

How can I strip all HTML codes except tag?

Here is an examples:

1 Answers1

How can I strip all HTML codes except _tag?