12

Need a bit of help with HTML Agility Pack!

Basically I want to grab plain-text withing the body node of the HTML. So far I have tried this in vb.net and it fails to return the innertext meaning no change is seen, well atleast from what I can see.

Dim htmldoc As HtmlDocument = New HtmlDocument
htmldoc.LoadHtml(html)

Dim paragraph As HtmlNodeCollection = htmldoc.DocumentNode.SelectNodes("//body")

If Not htmldoc Is Nothing Then
   For Each node In paragraph
       node.ParentNode.RemoveChild(node, True)
   Next
End If

Return htmldoc.DocumentNode.WriteContentTo

I have tried this:

Return htmldoc.DocumentNode.InnerText

But still no luck!

Any advice???

KJSR
  • 1,679
  • 6
  • 28
  • 51
  • 1
    possible duplicate of [Grab all text from html with Html Agility Pack](http://stackoverflow.com/questions/4182594/grab-all-text-from-html-with-html-agility-pack) – richard Aug 23 '15 at 10:02

2 Answers2

21

How about:

Return htmldoc.DocumentNode.SelectSingleNode("//body").InnerText
Jeff Mercado
  • 129,526
  • 32
  • 251
  • 272
  • Hi Jeff, I tried that earlier aswell but the returned data contains alot of unwanted characters like &,{,} and lots of html taging and also scripts tags. The line spacing is just out of the window :) Perhaps I could use regex to fix that. But I want to concentrate more on Html Agility Pack – KJSR Jul 27 '11 at 23:33
  • Well there's not much you can do about that except cleaning out the html of the undesirables. InnerText includes everything that you see in the document that is not part of the element. So that includes whitespace, code in script nodes, etc. If your goal is to get the text as it looks like rendered in a web browser, you're not going to get it this way. – Jeff Mercado Jul 27 '11 at 23:43
  • Hmm I understand what you mean. Perhaps I have got confused with innertext and plaintext. Main aim is to get back clean text or parsed HTML back with main text content in it. Could you show me how to got about this please. – KJSR Jul 27 '11 at 23:56
  • I don't really know to be honest. What I would do is load it up in a browser and get the text from the screen (not the source). Doing that programmatically is a different thing all together. – Jeff Mercado Jul 28 '11 at 00:01
  • Actually what I meant was an actual browser (IE, FF, Chrome, etc.). Though I suppose the `WebBrowser` control could help you with this, I just wouldn't know how. – Jeff Mercado Jul 28 '11 at 00:09
  • Updated answer to answer questions above – MGot90 Jul 11 '16 at 20:26
1

Jeff's solution is ok if you haven't tables, because text located in the table is sticking like cell1cell2cell3. To prevent this issue use this code (C# example):

var words = doc.DocumentNode?.SelectNodes("//body//text()")?.Select(x => x.InnerText);
return words != null ? string.Join(" ", words) : String.Empty;
EminST
  • 59
  • 5