7

I am using HtmlAgilityPack. Is there a one line code that I can get all inner text of html, e.g., remove all html tags and scripts?

Peter Boughton
  • 110,170
  • 32
  • 120
  • 176
Yang
  • 6,682
  • 20
  • 64
  • 96

2 Answers2

18

Like this:

document.DocumentNode.InnerText

Note that this will return the text content of <script> tags.

To fix that, you can remove all of the <script> tags, like this:

foreach(var script in doc.DocumentNode.Descendants("script").ToArray())
    script.Remove();
foreach(var style in doc.DocumentNode.Descendants("style").ToArray())
    style.Remove();
SLaks
  • 868,454
  • 176
  • 1,908
  • 1,964
1

I wrote a simple method. It may help you. This method can extract all specific tag's node. Then you can use the HtmlNodeCollection[i].InnerText to get its text.

    HtmlDocument hDoc;
    HtmlNodeCollection nodeCollection;

    public void InitInstance(string htmlCode) {
        hDoc.LoadHtml(htmlCode);
        nodeCollection = new HtmlNodeCollection();
    }
    private void GetAllNodesInnerTextByTagName(HtmlNode node, string tagName) {
        if (null == node.ChildNodes) {
            return ;
        } else {
            HtmlNodeCollection nCollection = node.SelectNodes( tagName );
            if( null != nCollection ) {
                for( int i=0; i<nCollection.Count; i++) {
                    nodeCollection.Add( nCollection[i]);
                    nCollection[i].Remove();
                }
            }
            nCollection=node.ChildNodes;
            if(null != nCollection) {
                for(int i=0;i<nCollection.Count; i++) {
                    GetAllNodesInnerTextByTagName( nCollection[i] , tagName );
                }
            }
        }
Leniel Maccaferri
  • 100,159
  • 46
  • 371
  • 480
tsingroo
  • 189
  • 1
  • 3