Regex to strip anything that isn't an html comment

Question

I know using a regex to parse html is normally a non-starter but I don't want anything that clever...

Taking this example

<div><!--<b>Test</b>-->Test</div>
<div><!--<b>Test2</b>-->Test2</div>

I'd like to strip out ANYTHING that isn't between  to get:

<b>Test</b><b>Test2</b>

Tags are guaranteed to be correctly matched (no unclosed/nested comments).

What regex do I need to use?

[You shouldn't try to parse HTML with regex](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) — Bohemian, Jan 16 '12 at 10:28
@Bohemian I've read it - Hence my first sentence, but I'm dealing with a very specific circumstance - find all text which has a `` after it. It's not really parsing at all but string matching. If the question had been "Find everything enclosed in brackets", nobody would bat an eyelid... — Basic, Jan 16 '12 at 10:32
@Gordon - a nice alternative. I gave it a try and it worked perfectly. — Basic, Jan 16 '12 at 12:58

Bart Kiers · Accepted Answer · 2012-01-16T11:01:28.310

Replace the pattern:

(?s)((?!-->).)*<!--|-->((?!<!--).)*

with an empty string.

A short explanation:

(?s)              # enable DOT-ALL
((?!-->).)*<!--   # match anything except '-->' ending with '<!--'
|                 # OR
-->((?!<!--).)*   # match '-->' followed by anything except '<!--'

Be careful when processing (X)HTML with regex. Whenever parts of comments occur in tag-attributes or CDATA blocks, things go wrong.

EDIT

Seeing your most active tag is JavaScript, here's a JS demo:

print(
  "<div><!--<b>Test</b>-->Test</div>\n<div><!--<b>Test2</b>-->Test2</div>"
  .replace(
    /((?!-->)[\s\S])*<!--|-->((?!<!--)[\s\S])*/g,
    ""
  )
);

which prints:

<b>Test</b><b>Test2</b>

Note that since JS does not support the (?s) flag, I used the equivalent [\s\S] which matches any character (including line break chars).

Test it on Ideone here: http://ideone.com/6yQaK

EDIT II

And a PHP demo would look like:

<?php
$s = "<div><!--<b>Test</b>-->Test</div>\n<div><!--<b>Test2</b>-->Test2</div>";
echo preg_replace('/(?s)((?!-->).)*<!--|-->((?!<!--).)*/', '', $s);
?>

which also prints:

<b>Test</b><b>Test2</b>

as can be seen on Ideone: http://ideone.com/Bm2uJ

@Basiclife, you're welcome. Also see the EDIT with a small demo. — Bart Kiers, Jan 16 '12 at 10:49
Nicely spotted on the tags but this is actually for use in PHP - `preg_replace("/(?s)((?!-->).)*((?! — Basic, Jan 16 '12 at 10:52
Thanks Bart, that worked - not sure why mine didn't but I'll spend some time digging when I can — Basic, Jan 16 '12 at 12:58

score 1 · Answer 2 · edited Jan 21 '12 at 03:55

1

s/-->.*?<--//g strips off anything between "-->" and the next "<--"

s/^.*?<--// strips off from the beginning to the first occurence of "<--"

s/-->.*?$// strips off from the last occurence of "-->" to the end

.* matches any amount of characters and .*? matches the least possible amount of characters, so that the hole pattern matches

^ stands for the beginning of a string and $ for the end

edited Jan 21 '12 at 03:55

Basic

26,321
24
115
201

answered Jan 16 '12 at 10:31

Hachi

3,237
1
21
29

Yes please - an explanation would be helpful. – Basic Jan 16 '12 at 10:36

score 1 · Answer 3 · answered Jan 16 '12 at 10:40

Another possibility would be this

.*?<!--(.*?)-->.*?(?=<!--|$)

and replace with

$1

See it here on Regexr

If you read your string row by row, this would match anything till the first comment, put the content of the first content into group 1 and then match anything till the end of the row or the next comment.

Regex to strip anything that isn't an html comment

3 Answers3

EDIT

EDIT II

Linked