0

Here is the super simple code i have:

HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.OptionWriteEmptyNodes = true;
htmlDoc.Load("sourcefilepath");
htmlDoc.Save("destfilepath", Encoding.UTF8);

Input:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta http-equiv="Content-Type" content="application/xhtml+xml; charset=utf-8"/>
    <link rel="stylesheet" href="main.css" type="text/css"/>
  </head>
  <body>lots of text here, obviously not relevant to this problem</body>
</html>

Output:

<?xml version="1.0" encoding="UTF-8" />
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta http-equiv="Content-Type" content="application/xhtml+xml; charset=utf-8" />
    <link rel="stylesheet" href="main.css" type="text/css" />
  </head>
  <body>lots of text here, obviously not relevant to this problem</body>
</html>

You can see that in the first line there is an error: /> instead of ?> This happens if i set OptionWriteEmptyNodes to true value. It has been set to true, because otherwise meta/link tags(and some others in the document body) won't be closed.

Anyone know how to solve this?

Alex
  • 127
  • 2
  • 12

3 Answers3

1

Seems like a bug. You should report it to http://htmlagilitypack.codeplex.com.

Still, you can workaround that bug like this:

HtmlNode.ElementsFlags.Remove("meta");
HtmlNode.ElementsFlags.Remove("link");
HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.Load("sourcefilepath");
htmlDoc.Save("destfilepath", Encoding.UTF8);

Just remove the flags from the meta & link tags that instruct the Html Agility Pack not to close them automatically, and don't set OptionWriteEmptyNodes to true.

It will produce this (note this is slightly different):

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta http-equiv="Content-Type" content="application/xhtml+xml; charset=utf-8"></meta>
    <link rel="stylesheet" href="main.css" type="text/css"></link>
  </head>
  <body>lots of text here, obviously not relevant to this problem</body>
</html>
Simon Mourier
  • 132,049
  • 21
  • 248
  • 298
  • Thanks, this is looking good as a workaround. Meanwhile a found this problem on the codeplex forums too, without resolution yet, but i believe it will be fixed soon. – Alex Jun 15 '12 at 11:06
1

Managed to do another way of workaround this problem. This works slightly better in my case than the one above. Basically we are replacing the first child of the DocumentNode, which is the xml declaration.(please note that the input must contain the xml declaration, in my case it's 100%)

HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.OptionWriteEmptyNodes = true;
htmlDoc.Load("sourcepath");

var newNodeStr = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>";
var newNode = HtmlNode.CreateNode(newNodeStr);

htmlDoc.DocumentNode.ReplaceChild(newNode, htmlDoc.DocumentNode.FirstChild);


htmlDoc.Save("destpath", Encoding.UTF8);

Please note that Simon's workaround works too, so take the one which better fits in your scenario.

Alex
  • 127
  • 2
  • 12
0

My pages also have <br/> tags in them, and removing htmlDoc.OptionWriteEmptyNodes = true; breaks those by replacing them with <br>. I've found an approach similar to Alex's answer, but a bit more generic so to keep most of the original values, and doesn't rely on there always being an xml tag in your page:

HtmlDocument doc= new HtmlDocument();
doc.OptionWriteEmptyNodes = true;
doc.Load("pathToFile");
if (doc.DocumentNode.FirstChild.OriginalName.Equals("?xml"))
{
    var fixedOuterHtml = doc.DocumentNode.FirstChild.OuterHtml.Replace('/', '?');
    var newNode = HtmlNode.CreateNode(fixedOuterHtml);
    doc.DocumentNode.ReplaceChild(newNode, doc.DocumentNode.FirstChild);
}
Norrec
  • 531
  • 4
  • 17