0

I know, by awk the solution is easy, but for this type of problem I'm stuck to sed quite often. I've hit the trap several times and could not find a solution anywhere, yet.

The sample:

<!-- comment #1 --><p>useful text</p>  <!-- comment #2 -->more useful text

How to eliminate the comments by sed?

Solutions like this one

cat file.html | sed -e :a -re 's/<!--.*?-->//g;/<!--/N;//ba'

(found here) manage multiple lines quite well (so I excluded this part of the problem), but trap in the "greedy" behavior of regex. None of the solutions I found handle the problem: "eliminate two comment blocks in one line".

My idea of the solution would look like this, but doesn't work:

sed -re 's/<!--[^(-->)]*-->//g' in.html > out.html

But all my efforts to negate the subexpression (-->) have failed.

I appreciate a general solution for this type of issue, but I'm curious if there is a way to negate a subexpression in sed (the reason for the subject).


Used version: sed (GNU sed) 4.7

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
dodrg
  • 1,142
  • 2
  • 18
  • 1
    Sed is not an XML/HTML parser. Use `xmlstarlet` – Gilles Quénot Mar 13 '23 at 18:30
  • 1
    _I'm curious if there is a way to negate a subexpression in sed_: no, `sed` does not support negative lookaround (`(?!-->)`). `sed` does not support arbitrary complex regex, reason why it is so small and fast. In many cases complex regex are not really needed but If you really need the full power of `perl` regex you'll have to use something else, like `perl`. – Renaud Pacalet Mar 14 '23 at 16:29
  • Why did you delete your *whole* post? I think, your first solution is functional and useful. Especially when the subexpression to exclude becomes to complex this is a good alternative. – dodrg Mar 14 '23 at 17:12
  • @dodrg Because it was just a variation around potong's idea. There is no need to have several answers with the same idea, it's more confusing than helpful for future readers. – Renaud Pacalet Mar 14 '23 at 17:26
  • @Renaud Pacalet I meant your solution with string replacement, so the subexpression is reduced to one character. This alternative will win, when it comes to long subexpressions. At that point the tradeoff of caring about the substitution character will pass the break even. – dodrg Mar 14 '23 at 17:41
  • @dodrg Oh, I see. I restored only that part. You're right, it might be useful in certain circumstances. – Renaud Pacalet Mar 14 '23 at 19:57

4 Answers4

2

This might work for you (GNU sed):

sed -E 'H;1h;$!d;x;s/<!--([^>-]+|(-?>+)+|(-+[^->]+))*(-?>+)*--+>//g' file

Slurp the whole file into memory.

If a string begins <!-- followed by zero or alternations of three variations: one or more characters which are neither > or -,a possible - followed by one or more >'s or one or more -'s followed by one or more characters which are neither - or >; followed by a closing zero or more combination of a possible - followed by one or more >'s followed by two or more -'s followed by > , remove that string globally throughout the file.

N.B. This assumes the file is well formed.


Kudos to Renaud Pacalet for the most elegant solution:

sed -E 'H;1h;$!d;x;s/<!--([^>]|[^-]?>|[^-]->)*-->//g' file

I ameliorated the solution slightly to take in the edge case <!-->-->.

potong
  • 55,640
  • 6
  • 51
  • 83
1

As explained in answers and comments using sed to parse HTML is sub-optimal. But as you are stuck with sed let's try to solve your problem with GNU sed (it may also work with other sed). Of course we will assume that any <!-- token opens a comment and that the next --> closes it. We will not design a true HTML parser with sed.

If you have delimiters in contexts where they should not be considered as such you'll have to switch to a real HTML parser.

Your main issue is that the comment delimiters are multi-character tokens and you cannot really negate such things in sed regular expressions. So, let's first substitute the ending delimiters for a single character (Shift-Out, ASCII code 0x14), then remove the comments, and finally substitute back (just in case you have unbalanced delimiters). If the Shift-Out character can be found in your inputs chose another one. We will also use the -z option to process your entire input as if it was a single line, this will take care of multi-lines comments (if you don't have NUL characters in your inputs).

$ str="<!-- cmt1 --> text1 <!-- cmt3
--> text2 <!-- cmt4 --> --> text3"

$ sed -z 's/-->/\x14/g;s/<!--[^\x14]*\x14//g;s/\x14/-->/g' <<< "$str"

 text1  text2  --> text3

The key is the <!--[^\x14]*\x14 regular expression that matches only a single comment.

If you know for sure that your comments are always perfectly balanced (no dangling delimiters) you can remove the last substitute command:

$ str="<!-- cmt1 --> text1 <!-- cmt3
--> text2 <!-- cmt4 -->"

$ sed -z 's/-->/\x14/g;s/<!--[^\x14]*\x14//g' <<< "$str"

 text1  text2
Renaud Pacalet
  • 25,260
  • 3
  • 34
  • 51
  • Thanks, this is currently the best solution. It should work in any cases where pure text is the content to filter. I think the choice of the replacement characters should be carefully done, when it comes to extended usage of non-latin UTF-8 characters. – dodrg Mar 14 '23 at 12:27
  • But still I hope to get a version that is capable to negate the subexpression itself. – dodrg Mar 14 '23 at 12:27
  • @potong: if you undelete your answer and update it I'll delete my own answer. – Renaud Pacalet Mar 14 '23 at 14:13
  • @potong: I appreciate your version of the solution. A pity that you deleted it. – dodrg Mar 14 '23 at 14:50
  • Your variation of potong's solution eliminates the issues and returns proper results for me. – dodrg Mar 14 '23 at 14:59
  • Good to know. I learned something here: the `sed` regex that corresponds to anything except a given `ABC` string (and that does not start with `C` or `BC`) can be constructed as `([^C]|[^B]C|[^A]BC)*`. And of course this extends to any length. – Renaud Pacalet Mar 14 '23 at 15:03
  • Yes, it's a nice construct. But being able to negate a subexpression would have been nice. That sed is not capable of doing so is a pity, things would be much cleaner to realize. — That's my point of conclusion. – dodrg Mar 14 '23 at 15:54
  • `sed` is small and fast. You cannot support arbitrary complex regex and be small and fast. If you need the full power of `perl` regex you'll have to use something else. – Renaud Pacalet Mar 14 '23 at 16:01
  • I've extended your variation of potong's solution: `sed -e :a -re 's///g;/ – dodrg Mar 14 '23 at 16:01
  • Nice but this does not make any difference if your 10GB file is just one big comment. So the worst case memory usage will be the same. – Renaud Pacalet Mar 14 '23 at 16:09
0

Using the proper tool:

Don't use sed nor regex to parse XML you cannot, must not parse any structured text like XML/HTML with tools designed to process raw text lines. If you need to process XML/HTML, use an XML/HTML parser. A great majority of languages have built-in support for parsing XML and there are dedicated tools like XMLStarlet if you need a quick shot from a command line shell. Never accept a job if you don't have access to proper tools.

file:

<root>
<!-- comment #1 -->
<p>useful text</p>
<!-- comment #2 -->
more useful text
</root>

code

$ xmlstarlet ed -d  '//comment()' file.xml
<?xml version="1.0"?>
<root><p>useful text</p>
more useful text
</root>
Gilles Quénot
  • 173,512
  • 41
  • 224
  • 223
  • Sure you are right mentioning text parsing tools are not adequate for structured content. But using this HTML example was the easiest way to visualize my problem. (1) **sed** is obligatory for the solution. (2) The "unorganized" usage of line breaks in the input are part of the challenge. else a simple `grep -v " – dodrg Mar 14 '23 at 00:57
-1

Using sed

$ sed 's/<!--[^>]*-->//g' input_file
<p>useful text</p>  more useful text
HatLess
  • 10,622
  • 5
  • 14
  • 32
  • Please reserve `sed` for plain text. Don't use regex to remove HTML/XML nodes. [Don't use `sed` nor `regex` to parse `XML`](https://stackoverflow.com/a/49352373/465183) you cannot, must not parse any structured text like XML/HTML with tools designed to process raw text lines. If you need to process XML/HTML, use an XML/HTML parser. A great majority of languages have built-in support for parsing XML and there are dedicated tools like XMLStarlet if you need a quick shot from a command line shell. Never accept a job if you don't have access to proper tools. – Gilles Quénot Mar 13 '23 at 18:38
  • @Gilles Quénot: Nice to see your evangelism. But not everyone is able to live in an Ivory Tower. Often the reality is almost the opposite of it. My question is a real life issue. – dodrg Mar 14 '23 at 07:48
  • @HatLess: Thank you for your sample. This one would do it, as long as the comment does not contain a `>`. But that's a quite common situation (a one-liner): ```
  • Content item
  • ```. Reasonably, such comments should be eliminated before production use. – dodrg Mar 14 '23 at 07:51
  • @dodrg `sed 's/\|//g' file` ? – HatLess Mar 14 '23 at 08:23
  • @HatLess: Your extended version will additionally remove the comment-tag-marks, but leave `
  • Alternate content
  • ` uncommented, making it visible when rendering the page. – dodrg Mar 14 '23 at 10:11