2

I need to perform some logic on all the text nodes of a HTMLDocument. This is how I currently do this:

HTMLDocument pageContent = (HTMLDocument)_webBrowser2.Document;
IHTMLElementCollection myCol = pageContent.all;
foreach (IHTMLDOMNode myElement in myCol)
{
    foreach (IHTMLDOMNode child in (IHTMLDOMChildrenCollection)myElement.childNodes)
    {
        if (child.nodeType == 3)
        {
           //Do something with textnode!
        }
     }
 }

Since some of the elements in myCol also have children, which themselves are in myCol, I visit some nodes more than once! There must be some better way to do this?

Cœur
  • 37,241
  • 25
  • 195
  • 267
nelshh
  • 1,021
  • 1
  • 11
  • 15

2 Answers2

2

It might be best to iterate over the childNodes (direct descendants) within a recursive function, starting at the top-level, something like:

HtmlElementCollection collection = pageContent.GetElementsByTagName("HTML");
IHTMLDOMNode htmlNode = (IHTMLDOMNode)collection[0];
ProcessChildNodes(htmlNode);

private void ProcessChildNodes(IHTMLDOMNode node)
{
    foreach (IHTMLDOMNode childNode in node.childNodes)
    {
        if (childNode.nodeType == 3)
        {
            // ...
        }
        ProcessChildNodes(childNode);
    }
}
Steve
  • 15,606
  • 3
  • 44
  • 39
1

You could access all the text nodes in one shot using XPath in HTML Agility Pack.

I think this would work as shown, but have not tried this out.

using HtmlAgilityPack;
HtmlDocument htmlDoc = new HtmlDocument();

// filePath is a path to a file containing the html
htmlDoc.Load(filePath);
HtmlNodeCollection coll = htmlDoc.DocumentNode.SelectNodes("//text()");

foreach (HTMLNode node in coll)
{
  // do the work for a text node here
}
Steve Townsend
  • 53,498
  • 9
  • 91
  • 140