delete html comment tags using regexp

Question

This is how my text (html) file looks like
    <!--
     |                                |
     |  This is a dummy comment       |
     |      please delete me          |
     |         asap                   |
     |                                |
      ________________________________
     | -->

    this is another line 
    in this long dummy html file...
    please do not delete me

I'm trying to delete the comment using sed :

cat file.html | sed 's/.*<!--\(.*\)-->.*//g'

It doesn't work :( What am I doing wrong?

Thank you very much for your help!

The [usual warnings](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) apply. — Dennis Williamson, Oct 30 '10 at 01:59
@Dennis: It's what worked for me with [RegExPal](http://regexpal.com/), I didn't realize that `sed` used a different syntax for regular expressions. — drudge, Nov 01 '10 at 19:16

Brian Clements · Accepted Answer · 2010-10-29T22:11:32.577

17

patrickmdnet has the correct answer. Here it is on one line using extended regex:

cat file.html | sed -e :a -re 's/<!--.*?-->//g;/<!--/N;//ba'

Here is a good resource for learning more about sed. This sed is an adaptation of one-liner #92

http://www.catonmat.net/blog/sed-one-liners-explained-part-three/

edited Oct 29 '10 at 22:11

answered Oct 29 '10 at 21:31

Brian Clements

3,787
1
25
26

Thanks Brian! You're great :) what does the :a mean in your sed command? – Zenet Oct 29 '10 at 21:39
It creates a branch label named 'a'. The '//ba' at the end is branching to 'a'. – Brian Clements Oct 29 '10 at 22:06
Is the `//` before `ba` necessary? I don't need it in GNU `sed`. – Dennis Williamson Oct 30 '10 at 01:49
The double slash is short-hand for the previous expression (which is / – Brian Clements Oct 30 '10 at 02:16

patrickmdnet · Answer 2 · 2010-10-29T21:20:13.290

One problem with your original attempt is that your regex only handles comments that are entirely on one line. Also, the leading and trailing ".*" will remove non-comment text.

You would better off using existing code instead of rolling your own.

http://sed.sourceforge.net/grabbag/scripts/strip_html_comments.sed

#! /bin/sed -f
# Delete HTML comments
# i.e. everything between <!-- and -->
# by Stewart Ravenhall <stewart.ravenhall@ukonline.co.uk>

/<!--/!b
:a
/-->/!{
    N
    ba
}
s/<!--.*-->//

(from http://sed.sourceforge.net/grabbag/scripts/)

See this link for various ways to use perl modules for removing HTML comments (using Regexp::Common, HTML::Parser, or File::Comments.) I am sure there are methods using other utilities.

http://www.perlmonks.org/?node_id=500603

eldarerathis · Answer 3 · 2010-10-29T21:23:54.297

3

I think you can do this with awk if you want. Start:

[~] $ more test.txt
<!--

An HTML style comment 

-->

Some other text

<div>
<p>blah</p>
</div>

<!-- Whoops
     Another comment -->
<span>Something</span>

Result of the awk:

[~]$ cat test.txt | awk '/<!--/ {off=1} /-->/ {off=2} /([\s\S]*)/ {if (off==0) print; if (off==2) off=0}'
Some other text

<div>
<p>blah</p>
</div>

<span>Something</span>

edited Oct 29 '10 at 21:23

answered Oct 29 '10 at 21:02

eldarerathis

35,455
10
90
93

fyi I addressed the concern of @john-jones with a slight change to the awk code, [here](https://stackoverflow.com/a/74072379/1238406). – Barumpus Oct 14 '22 at 16:53

score 0 · Answer 4 · answered Oct 14 '22 at 16:51

Improving (hopefully) on the awk-based answer provided by eldarerathis --

The code below addresses the concern raised by john-jones.

In this version, the prefix leading up to the start of the html comment is preserved, as is the suffix following the close of the html comment.

$ cat some-file | awk '/<!--/ { mode=1; start=index($0,"<!--"); prefix=substr($0,1,start-1); } /-->/ { mode=2; start=index($0, "-->")+3; suffix=substr($0,start); print prefix suffix; prefix=""; suffix=""; } /./ { if (mode==0) print $0; if (mode==2) mode=0; }'

for example

$ cat test.txt
<!--

An HTML style comment

-->

<meta charset="utf-8"> <!-- charset encoding must be within the first 1024 bytes of the document -->
Some other text

<div>
<p>blah</p>
</div>

<!-- Whoops
     Another comment -->
<span>Something</span>

<div> <!-- start of foo -->
foo
</div> <!-- end of foo -->

<div> <!-- start of multiline comment
bar
end of multiline comment --> </div>

$ cat test.txt | awk '/<!--/ { mode=1; start=index($0,"<!--"); prefix=substr($0,1,start-1); } /-->/ { mode=2; start=index($0, "-->")+3; suffix=substr($0,start); print prefix suffix; prefix=""; suffix=""; } /./ { if (mode==0) print $0; if (mode==2) mode=0; }'

Some other text
<div>
<p>blah</p>
</div>

<span>Something</span>
<meta charset="utf-8">
<div>
foo
</div>
<div>  </div>

delete html comment tags using regexp

4 Answers4

Linked