1

I'm parsing an HTML DOM in C# with the HTMLAgilityPack library and would like to know how to traverse the DOM once I get to a specific element.

For example, when I get to the td with a class of "some-class", I want to go to the third sibling td and grab the href of its nested anchor.

<td class="some-class">Content I care about</td>
<td>Content I don't want</td>
<td>Content I don't want</td>
<td>    
    <a href="http://www.the-url-I-want.com">Some Amazing URL</a>
</td>

Currently, I'm landing at the td I want via:

foreach(HtmlNode node in doc.DocumentNode.SelectNodes("//td"))
{
    HtmlAttribute nodeClass = node.Attributes["class"];

    if(nodeClass != null && nodeClass.Value == "some-class")
    {
        //Find the anchor that is 3 siblings away
        //Do something
    }
}

Does anyone know how I would use HTMLAgility pack to grab the related anchor for the individual td?

Zach B
  • 534
  • 7
  • 25
  • So far, this works (but feels ridiculously clunky)... `HtmlNode siblingAnchor = node.NextSibling.NextSibling.NextSibling.NextSibling.NextSibling.NextSibling.NextSibling.NextSibling.NextSibling.NextSibling.FirstChild.NextSibling;` – Zach B Sep 03 '14 at 22:28
  • This is actually reasonable code to get particular sibling node (if you can tolerate occasional `NullRefferenceException` if HTML changes). I'd recommend reading some basic XPath tutorial to be able to select elements more precisely/faster (i.e. `"//"` have to search whole tree for match and often you can narrow down search to at least some particular sub-tree, you can also match attributes directly with XPath). – Alexei Levenkov Sep 03 '14 at 23:31
  • @AlexeiLevenkov The HTML should be static and if it changes, I'll be able to validate in the output of the program. I'll dig into XPath tutorials to see how to optimize. Thanks again for referring me to the HTMLAgilityPack to begin with, btw :) – Zach B Sep 03 '14 at 23:43

1 Answers1

3

Learn XPath and your job can be a lot easier. For example, to get <td> element having class attribute equals "some-class", we can use this Xpath :

//td[@class='some-class']

And for getting third next sibling <td> :

/following-sibling::td[3]

So your loop can be re-written as follow :

var xpath = "//td[@class='some-class']/following-sibling::td[3]/a";
foreach(HtmlNode a in doc.DocumentNode.SelectNodes(xpath))
{
    //Do something with the anchor variable a
}

BTW, safer way for getting attribute value is using GetAttributeValue() method :

var href = a.GetAttributeValue("href", "");

the second argument is default value that will be returned when the attribute not found.

har07
  • 88,338
  • 12
  • 84
  • 137
  • 2
    This will have an problem if the `td` has more than one class (`td class="one_class two_class"`). Appart from that, it is beautiful in its simplicity. http://stackoverflow.com/questions/1604471/how-can-i-find-an-element-by-css-class-with-xpath handles multiple classes. – Jon P Sep 04 '14 at 04:20
  • @JonP The post you linked contains useful info indeed. My answer just didn't try to cover every possible problem unless it is clearly reflected in the OP's sample markup, to keep this simple – har07 Sep 04 '14 at 04:32
  • I agree, keep it as simple as possible, for as long as possible. – Jon P Sep 04 '14 at 04:43