0

Suppose I have an XML document that looks something like (basically represents an HTML report):

<html>
 <head>...</head>
 <body>
   <div>
   <table>
     <tr>
       <td>Stuff</td>
     </tr>
     <tr>
       <td>More stuff<br /><br />More stuff on another line and some whitespace...  </td>
     </tr>
     <tr>
       <td>  Some leading whitespace before this stuff<br />Stuff</td>
     </tr>
   </table>
   </div>
 </body>
</html>

I want to (using C#) convert this document into a simple text string that looks something like:

Stuff
More stuff

More stuff on another line and some whitespace...
  Some leading whitespace before this stuff
Stuff

It should be smart enough to convert table rows into new lines and insert new lines where any inline br tags were added within a cell. It should also keep any whitespace in the table cells intact. I tried using the XmlDocument class and used the InnerText method on the body node, but it doesn't seem to create the output I am looking for (newlines and whitespace are not intact). Is there a simple way to do this? I know one way to do this would be to extract the HTML as one string and do several regular expressions on it to handle the newlines and whitespace. Thanks!

Andrew
  • 1,581
  • 3
  • 18
  • 31
  • This won't help? https://stackoverflow.com/questions/731649/how-can-i-convert-html-to-text-in-c – hardkoded Jun 10 '17 at 14:45
  • ML is Meta Language (tagged data). XML and HTML are both types of Meta Languages with differences so you can't go from XML to HTML. Occasionally XML is embedded in an HTML document, but in your case you have just html with no xml. – jdweng Jun 10 '17 at 15:06

1 Answers1

0

Try this please:

var doc = XElement.Load("test.xml");

var sb = new StringBuilder();

foreach (var text in doc.DescendantNodes().Where(node => node.NodeType == XmlNodeType.Text))
{
    sb.AppendLine(((XText)text).Value);
}

More concise:

foreach (var text in doc.DescendantNodes().OfType<XText>())
{
    sb.AppendLine(text.ToString());
}
Alexander Petrov
  • 13,457
  • 2
  • 20
  • 49