1

I'm reading HTML with the purpose of extracting only the contents of <body> from it.

The following markup is generated by a DevExpress RichEditControl

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /><title>
        </title>
        <style type="text/css">
            .cs95E872D0{text-align:left;text-indent:0pt;margin:0pt 0pt 0pt 0pt}
            .csCF6BBF71{color:#000000;background-color:transparent;font-family:Times New Roman;font-size:12pt;font-weight:normal;font-style:normal;}
        </style>
    </head>
    <body>
        <p class="cs95E872D0"><span class="csCF6BBF71">Content goes here</span></p></body>
</html>

Following the example from this answer on how to read the document, I wrote the following function:

private string ParseHtml(string html)
{
    XDocument doc = XDocument.Parse(html);
    return doc.Elements("html").Single().Element("body").Value;
}

Seems like it should work in theory but in practice, the LINQ query returns no results for .Elements("html")

Am I way off the mark here? How can I read the html document and extract what I need?

Ortund
  • 8,095
  • 18
  • 71
  • 139

1 Answers1

1

Probably is because you need to add the namespace:

 private string ParseHtml(string html)
 {
    XNamespace xmlns= "http://www.w3.org/1999/xhtml";

    XDocument doc = XDocument.Parse(html);
    return doc.Element(xmlns+"html").Element(xmlns+"body").Value;
 }

Or:

return doc.Descendants(xmlns+"body").Single().Value;

Also a good way to parse an html is using HTML Agility Pack

ocuenca
  • 38,548
  • 11
  • 89
  • 102
  • Just to add `XNamespace` has a method GetName for this and XName has a get method for this as well. And rather then hard coding the namespace doc.Root.GetDefaultNamespace(); will get you the "http://www.w3.org/1999/xhtml" and it will work if you have no namespace in the element. – Filip Cordas Jul 19 '17 at 15:43