2

I know using a regex to parse html is normally a non-starter but I don't want anything that clever...

Taking this example

<div><!--<b>Test</b>-->Test</div>
<div><!--<b>Test2</b>-->Test2</div>

I'd like to strip out ANYTHING that isn't between <!-- and --> to get:

<b>Test</b><b>Test2</b>

Tags are guaranteed to be correctly matched (no unclosed/nested comments).

What regex do I need to use?

Basic
  • 26,321
  • 24
  • 115
  • 201
  • [You shouldn't try to parse HTML with regex](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – Bohemian Jan 16 '12 at 10:28
  • 5
    @Bohemian I've read it - Hence my first sentence, but I'm dealing with a very specific circumstance - find all text which has a `` after it. It's not really parsing at all but string matching. If the question had been "Find everything enclosed in brackets", nobody would bat an eyelid... – Basic Jan 16 '12 at 10:32
  • 1
    http://codepad.viper-7.com/eYvaJj – Gordon Jan 16 '12 at 11:08
  • 1
    @Gordon - a nice alternative. I gave it a try and it worked perfectly. – Basic Jan 16 '12 at 12:58

3 Answers3

4

Replace the pattern:

(?s)((?!-->).)*<!--|-->((?!<!--).)*

with an empty string.

A short explanation:

(?s)              # enable DOT-ALL
((?!-->).)*<!--   # match anything except '-->' ending with '<!--'
|                 # OR
-->((?!<!--).)*   # match '-->' followed by anything except '<!--'

Be careful when processing (X)HTML with regex. Whenever parts of comments occur in tag-attributes or CDATA blocks, things go wrong.

EDIT

Seeing your most active tag is JavaScript, here's a JS demo:

print(
  "<div><!--<b>Test</b>-->Test</div>\n<div><!--<b>Test2</b>-->Test2</div>"
  .replace(
    /((?!-->)[\s\S])*<!--|-->((?!<!--)[\s\S])*/g,
    ""
  )
);

which prints:

<b>Test</b><b>Test2</b>

Note that since JS does not support the (?s) flag, I used the equivalent [\s\S] which matches any character (including line break chars).

Test it on Ideone here: http://ideone.com/6yQaK

EDIT II

And a PHP demo would look like:

<?php
$s = "<div><!--<b>Test</b>-->Test</div>\n<div><!--<b>Test2</b>-->Test2</div>";
echo preg_replace('/(?s)((?!-->).)*<!--|-->((?!<!--).)*/', '', $s);
?>

which also prints:

<b>Test</b><b>Test2</b>

as can be seen on Ideone: http://ideone.com/Bm2uJ

Bart Kiers
  • 166,582
  • 36
  • 299
  • 288
1
s/-->.*?<--//g strips off anything between "-->" and the next "<--"

s/^.*?<--// strips off from the beginning to the first occurence of "<--"

s/-->.*?$// strips off from the last occurence of "-->" to the end

.* matches any amount of characters and .*? matches the least possible amount of characters, so that the hole pattern matches

^ stands for the beginning of a string and $ for the end

Basic
  • 26,321
  • 24
  • 115
  • 201
Hachi
  • 3,237
  • 1
  • 21
  • 29
1

Another possibility would be this

.*?<!--(.*?)-->.*?(?=<!--|$)

and replace with

$1

See it here on Regexr

If you read your string row by row, this would match anything till the first comment, put the content of the first content into group 1 and then match anything till the end of the row or the next comment.

stema
  • 90,351
  • 20
  • 107
  • 135