I'm reading HTML with the purpose of extracting only the contents of <body>
from it.
The following markup is generated by a DevExpress RichEditControl
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" /><title>
</title>
<style type="text/css">
.cs95E872D0{text-align:left;text-indent:0pt;margin:0pt 0pt 0pt 0pt}
.csCF6BBF71{color:#000000;background-color:transparent;font-family:Times New Roman;font-size:12pt;font-weight:normal;font-style:normal;}
</style>
</head>
<body>
<p class="cs95E872D0"><span class="csCF6BBF71">Content goes here</span></p></body>
</html>
Following the example from this answer on how to read the document, I wrote the following function:
private string ParseHtml(string html)
{
XDocument doc = XDocument.Parse(html);
return doc.Elements("html").Single().Element("body").Value;
}
Seems like it should work in theory but in practice, the LINQ query returns no results for .Elements("html")
Am I way off the mark here? How can I read the html document and extract what I need?