6

I have a requirement to extract all the text that is present in the <body> of the html. Sample Html input :-

<html>
    <title>title</title>
    <body>
           <h1> This is a big title.</h1>
           How are doing you?
           <h3> I am fine </h3>
           <img src="abc.jpg"/>
    </body>
</html>

The output should be :-

This is a big title. How are doing you? I am fine

I want to use only HtmlAgility for this purpose. No regular expressions please.

I know how to load HtmlDocument and then using xquery like '//body' we can get body contents. But how do I strip the html as I have shown in output?

Thanks in advance :)

TCM
  • 16,780
  • 43
  • 156
  • 254
  • 1
    See [this question](http://stackoverflow.com/questions/846994/how-to-use-html-agility-pack) for some HTML Agility Pack links. I would guess you have to call something like `InnerText` property on the `HtmlNode`. – Uwe Keim May 01 '11 at 09:49

4 Answers4

5

You can use the body's InnerText:

string html = @"
<html>
    <title>title</title>
    <body>
           <h1> This is a big title.</h1>
           How are doing you?
           <h3> I am fine </h3>
           <img src=""abc.jpg""/>
    </body>
</html>";

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
string text = doc.DocumentNode.SelectSingleNode("//body").InnerText;

Next, you may want to collapse spaces and new lines:

text = Regex.Replace(text, @"\s+", " ").Trim();

Note, however, that while it is working in this case, markup such as hello<br>world or hello<i>world</i> will be converted by InnerText to helloworld - removing the tags. It is difficult to solve that issue, as display is ofter determined by the CSS, not just by the markup.

Kobi
  • 135,331
  • 41
  • 252
  • 292
3

How about using the XPath expression '//body//text()' to select all text nodes?

Oleks
  • 31,955
  • 11
  • 77
  • 132
chiborg
  • 26,978
  • 14
  • 97
  • 115
2

You can use NUglify that supports text extraction from HTML:

var result = Uglify.HtmlToText("<div>  <p>This is <em>   a text    </em></p>   </div>");
Console.WriteLine(result.Code);   // prints: This is a text

As it is using a HTML5 custom parser, it should be quite robust (specially if the document doesn't contain any errors) and is a very fast (no regexp involved but a pure recursive descent parser, faster than HtmlAgilityPack and more GC friendly)

xoofx
  • 3,682
  • 1
  • 17
  • 32
1

Normally for parsing html I would recommend a HTML parser, however since you want to remove all html tags a simple regex should work.

TheLukeMcCarthy
  • 2,253
  • 2
  • 25
  • 34