1

How can i parse the rel="canonical" tag with URL from a html document?

I want to find the url here:

<link rel="canonical" href="http://stackoverflow.com/questions/2593147/html-agility-pack-make-code-look-neat" />
Elvin
  • 367
  • 3
  • 5
  • 16

4 Answers4

4

Suppose doc is your HtmlDocument object.

HtmlNodeCollection links = doc.DocumentNode.SelectNodes("//link[@rel]");

should get you the link elements that have a rel attribute. Now iterate:

foreach (HtmlNode link in links)
{
    string url;
    if (link.Attributes["rel"] == "canonical") {
        url = link.Attributes["href"];
    }
}

Also, it's possible to filter links in the SelectNodes call to only get the ones with "canonical": doc.DocumentNode.SelectNodes("//link[@rel='canonical']");

Not tested code, but you get the idea :)

CyberDude
  • 8,541
  • 5
  • 29
  • 47
3

The accepted answer is no longer correct, updated code is below:

var links = htmlDoc.DocumentNode.SelectNodes("//link[@rel]");

string canonical;

foreach (HtmlNode link in links)
{
    if (link.Attributes["rel"].Value == "canonical")
    {
        canonical = link.Attributes["href"].Value;
    }
}
JMK
  • 27,273
  • 52
  • 163
  • 280
  • What is wrong with the accepted answer? It appears to still work? Possibly an issue with single vs double quotes maybe? – Moss Palmer Jul 08 '16 at 10:19
  • @MossPalmer This was a couple of months ago, but IIRC you need the **.Value** now – JMK Jul 08 '16 at 11:38
  • Ah yes. Thanks for clearing that up. I had added that without even looking. Might be worth noting that in your answer. Thanks. – Moss Palmer Jul 08 '16 at 11:39
  • I had thought the //link[@rel]='canonical' was the part that you were flagging as wrong but this still works well. – Moss Palmer Jul 08 '16 at 11:40
  • How about canonical = link.GetAttributeValue("href", null); instead of canonical = link.Attributes["href"].Value; This will throw exception if the link attribute is null – Ronald Ramos Apr 29 '17 at 21:03
0
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(_html);

String link = (from x in doc.DocumentNode.Descendants()
           where x.Name == "link"
           && x.Attributes["rel"] != null
           && x.Attributes["rel"].Value == "canonical"
           && x.Attributes["href"] != null
           select x.Attributes["href"].Value).FirstOrDefault();
Danilo Vulović
  • 2,983
  • 20
  • 31
0

HtmlDocument html= new HtmlDocument(); doc.LoadHtml(_html);

var _canonical = html.DocumentNode.SelectSingleNode("//link[@rel='canonical']").Attributes["href"].Value;