HTML Agility Pack Node Selection

Question

I'm brand new to HTML Agility Pack (as well as network-based programming in general). I am trying to extract a specific line of HTML, but I don't know enough about HTML Agility Pack's syntax to understand what I'm not writing correctly (and am lost in their documentation). URLs here are modified.

        string html;
        using (WebClient client = new WebClient())
        {
            html = client.DownloadString("https://google.com/");
        }

        HtmlDocument doc = new HtmlDocument();
        doc.LoadHtml(html);

        foreach (HtmlNode img in doc.DocumentNode.SelectNodes("//div[@class='ngg-gallery-thumbnail-box']//div[@class='ngg-gallery-thumbnail']//a"))
        {
            Debug.Log(img.GetAttributeValue("href", null));
        }

        return null;

This is what the HTML looks like

<div id="ngg-image-3" class="ngg-gallery-thumbnail-box" >
    <div class="ngg-gallery-thumbnail">
            <a href="https://urlhere.png"
             // More code here
            </a>
    </div>
</div>

The problem occurs on the foreach line. I've tried matching examples online the best I can but am missing it. TIA.

Possible duplicate of [What is a NullReferenceException, and how do I fix it?](https://stackoverflow.com/questions/4660142/what-is-a-nullreferenceexception-and-how-do-i-fix-it) — SᴇM, May 17 '19 at 06:40
@SᴇM What do you mean? I know the issue lies in SelectNodes(). If I write SelectNodes("//img") instead, and img.GetAttributeValue("src", null)) it'll print a bunch of URLs. But I don't want all the images in the HTML, just a particular group. — tayusuki, May 17 '19 at 06:52
Protip: never mention `NullReferenceException` in your question anywhere if it's not actually about that; we get way too many inspecific questions about it and you'll get knee-jerk close votes. Your question is actually about how to select particular nodes with HTML Agility and why a particular `SelectNodes` call isn't returning any nodes. (And no, I don't know the answer.) — Jeroen Mostert, May 17 '19 at 07:01
@JeroenMostert Thanks for the heads up, I went ahead and revised it to hopefully be more clear. — tayusuki, May 17 '19 at 07:05
You should check whether `doc.DocumentNode.SelectNodes` is null before you do a foreach on it (that's where the exception is). TBH your HTML matches your XPATH, but if you're getting the error, your HTML is a most likely different than you expect. Check for escape characters, etc. — Tyress, May 17 '19 at 07:07

score 0 · Answer 1 · answered May 22 '19 at 15:32

HTMLAgilityPack uses XPath syntax to query nodes - HAP effectively converts the HTML document into an XML document. So the trick is learning about XPATH querying so you can get the right combinations of tags and attributes to get the result you need.

The HTML snippet you pasted isn't well formed (there's no closing >on the anchor tag. Assuming that it is closed, then

//div[@class='ngg-gallery-thumbnail-box']//div[@class='ngg-gallery-thumbnail']//a[@href]

will return an XPathNodeList of only those tags that have href attributes.

If there are none that meet your criteria, nothing will be written.

For debugging purposes, perhaps log less specific query node count or OuterXml to see what you're getting e.g.

Debug.Log(doc.DocumentNode.SelectNodes("//div[@class='ngg-gallery-thumbnail-box']//div[@class='ngg-gallery-thumbnail'])[0].OuterXml)

HTML Agility Pack Node Selection

1 Answers1