11

I am trying to remove unnecessary content from HTML. Specifically I want to remove comments. I found a pretty good solution (Grabbing meta-tags and comments using HTML Agility Pack) however the DOCTYPE is treated as a comment and therefore removed along with the comments. How can I improve the code below to make sure the DOCTYPE is preserved?

var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(htmlContent);
var nodes = htmlDoc.DocumentNode.SelectNodes("//comment()");
if (nodes != null)
{
    foreach (HtmlNode comment in nodes)
    {
        comment.ParentNode.RemoveChild(comment);
    }
}
Community
  • 1
  • 1
desautelsj
  • 3,587
  • 4
  • 37
  • 55

2 Answers2

24
doc.DocumentNode.Descendants()
 .Where(n => n.NodeType == HtmlAgilityPack.HtmlNodeType.Comment)
 .ToList()
 .ForEach(n => n.Remove());

this will strip off all comments from the document

Jim Counts
  • 12,535
  • 9
  • 45
  • 63
BlueBird
  • 269
  • 2
  • 2
9

Check that comment does not start with DOCTYPE

  foreach (var comment in nodes)
  {
     if (!comment.InnerText.StartsWith("DOCTYPE"))
         comment.ParentNode.RemoveChild(comment);
  }
Richard Schneider
  • 34,944
  • 9
  • 57
  • 73
  • Is that safe? What if there is a comment like ? I know it's an edge case but I guess my point is: isn't there a better way than to check the content of the comment node? – desautelsj Jul 04 '11 at 05:52
  • Maybe ignore it when it starts with DOCTYPE and is the FIRST child of the root element? – Richard Schneider Jul 04 '11 at 06:00
  • I did some testing and figured out the content of the comment actually includes '<!'. This means I can improve your suggestion a little bit: `code`if (!comment.InnerText.StartsWith(" – desautelsj Jul 04 '11 at 06:10