2

I am looking for an efficient way to strip HTML comments from a string representation of HTML:

<div>
  <!-- remove this -->
  <ul>
    <!-- and this -->
    <li></li>
    <li></li>
  </ul>
</div>

I do not want to convert the string to actual nodes, the content is originally a string and the filesize is around 600mb.

Curious if anyone has had this problem before and found an efficient, and easily generalized solution.

TaylorMac
  • 8,882
  • 21
  • 76
  • 104
  • 3
    Regular expressions acceptable? – David Bradbury Jul 22 '13 at 16:52
  • absolutely, preferred – TaylorMac Jul 22 '13 at 16:54
  • If you don't have nasty markup like `
    ` (notice, though, this **is** valid HTML) then regexes are acceptable, as comments are not nestable (`` is the "same" as ` -->`). If you do, or if you are afraid you might, among other reasons, then consider a broader tool, like a parser.
    – acdcjunior Jul 22 '13 at 16:55
  • @acdcjunior ` -->` is **not** the same as ``. The first parses too `{ignored} -->`, the second parses as `{ignored}` where `{ignored}` is the part the HTML parser ignores. This is precicely because comments are **not** nestable. – dtech Jul 22 '13 at 17:00
  • @dtech You are right. I expressed myself wrongly. I meant the initial part was the same. Meaning the second `-->` would not be a part of the comment, just as you said, so the regexes wouldn't have to mind nesting. (Edit: It does not matter now, I just tested, `` is **not valid HTML** as I predicted. It yields the error ***The document is not mappable to XML 1.0 due to two consecutive hyphens in a comment.*** in the second `--`, meaning two hyphens inside a comment are only allowed to close it, nothing else.) – acdcjunior Jul 22 '13 at 17:04
  • I don't think your gonna be able to reasonably treat a 600mb string without lag & memory problems. – Robert Hoffmann Jul 22 '13 at 17:04
  • Would you suggest a technique for achieving this on a 600mb file? – TaylorMac Jul 22 '13 at 17:07
  • At which file size would you say that lag and memory problems would occur when modifying a string in this way? – TaylorMac Jul 22 '13 at 17:08
  • I ask because I can separate the file into smaller files no problem. I'm just looking for an efficient solution – TaylorMac Jul 22 '13 at 17:09
  • please see http://stackoverflow.com/a/5654032/2100709 html = html.replace(//g, "") – mwein Jul 22 '13 at 17:09
  • @TaylorMac To use regexes, you'll have to make several assumptions about the file (it is valid HTML + there are no `
    ` + the others suggested in the link by @mwein). Even with a parser, you'd have to know how it deals with invalid HTML. Why JavaScript, tho?
    – acdcjunior Jul 22 '13 at 17:18

1 Answers1

3

assuming the variable s represents your html string, a RexExp replace as follows should work just fine.

s = s.replace(/<!--[\s\S]+?-->/g,"");

Variable s should now have comments removed.

Fabrício Matté
  • 69,329
  • 26
  • 129
  • 166
Jim Elrod
  • 121
  • 6
  • Will test, although I am sure this is likely the best solution – TaylorMac Jul 22 '13 at 17:24
  • Be aware of the gotcha that of `--` inside HTML comments are "forbidden" according to MDN : https://developer.mozilla.org/en-US/docs/Web/API/Comment#Specification but are allowed by the spec though it makes a note that some tools may not allow them. This trips me up when I occasionally try and comment out `script` tags – HBP Jul 22 '13 at 17:56
  • the ban on "--" is for XML documents, not HTML documents. read your link again. – dandavis Jul 22 '13 at 20:46