0

I need to remove all HTML tags except:

  • it is <sub> tag
  • there is {1 (or more) newline(s) + 4 (or more) spaces} in the behind of it
  • it is surrounded into "`" character.

Here is an examples:

var str = "something1
           <sub>
             something2
             <div class='myclass'>something3</div>
           </sub>
           <div class='myclass'>something4</div>
           something5

               <div class='myclass'>something6</div>
           <div class='myclass'>something7</div>
           `<div>something8</div>`
           something9";

Expected output:

/*   
something1
<sub>
  something2
  something3
</sub>
something4
something5

    <div class='myclass'>something6</div>
`<div>something8</div>`
something9

Here is what I've tried so far:

/\n\s{0,3}<.*[^>]+|<sub>.*?<\/sub>|`.*?`/gm
Community
  • 1
  • 1
Martin AJ
  • 6,261
  • 8
  • 53
  • 111
  • 2
    before using regex on HTML, read http://stackoverflow.com/a/1732454/5053002 – Jaromanda X Sep 04 '16 at 03:08
  • @JaromandaX Yes, agreed .. A HTML parser usually would be much better than regex for working on HTML. But I guess in this case regex is better. – Martin AJ Sep 04 '16 at 03:13
  • Your `str` variable assignment isn't valid JS - is it just formatted like that to make it easier for us to read? – nnnnnn Sep 04 '16 at 03:19
  • @nnnnnn yes exactly .. actually that's the value of a textarea. – Martin AJ Sep 04 '16 at 03:20
  • Are you using JS in the backend? If yes I know npm validator can strip all HTML tags. I am not sure if you can allow certain tags. But if not, you can definitely create a new module and add your own method (pre-processing). I heard of npm striptags and npm string. They may allow certain tags. Read their doc. – user3207158 Sep 04 '16 at 03:52
  • @user3207158 ok, just as a note, I guess JS is front-end *(not back-end)* .. and yes I use it. thank you anyway. – Martin AJ Sep 04 '16 at 03:58
  • 2
    @MartinAJ Not always..you can write JS code with nodejs in the backend – user3207158 Sep 04 '16 at 04:03
  • I don't think there's a *one regex to solve it all* here. You will probably have to split the thing by \n and then run each line, finding sub and what not. – A. L Sep 04 '16 at 06:17

1 Answers1

0

This is possible with regex substitutions. Use this regex with mg modifiers:

(\n\n    .*|`[^`]+`|<\/?sub\b[^>]+>)|<[^>]+>

And use $1 as the substitution.

There are several parts to this. The capturing group finds all the HTML you may want to keep:

  • \n\n .* An empty line, and another line that starts with 4 spaces.
  • `[^`]+` Things in Back`Ticks.
  • <\/?sub\b[^>]+>) This matches sub HTML elements, opening or closing.

The remaining HTML elements will match <[^>]+>, which is discarded.

Laurel
  • 5,965
  • 14
  • 31
  • 57