How can i parse the rel="canonical" tag with URL from a html document?
I want to find the url here:
<link rel="canonical" href="http://stackoverflow.com/questions/2593147/html-agility-pack-make-code-look-neat" />
How can i parse the rel="canonical" tag with URL from a html document?
I want to find the url here:
<link rel="canonical" href="http://stackoverflow.com/questions/2593147/html-agility-pack-make-code-look-neat" />
Suppose doc
is your HtmlDocument
object.
HtmlNodeCollection links = doc.DocumentNode.SelectNodes("//link[@rel]");
should get you the link
elements that have a rel
attribute. Now iterate:
foreach (HtmlNode link in links)
{
string url;
if (link.Attributes["rel"] == "canonical") {
url = link.Attributes["href"];
}
}
Also, it's possible to filter links in the SelectNodes call to only get the ones with "canonical": doc.DocumentNode.SelectNodes("//link[@rel='canonical']");
Not tested code, but you get the idea :)
The accepted answer is no longer correct, updated code is below:
var links = htmlDoc.DocumentNode.SelectNodes("//link[@rel]");
string canonical;
foreach (HtmlNode link in links)
{
if (link.Attributes["rel"].Value == "canonical")
{
canonical = link.Attributes["href"].Value;
}
}
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(_html);
String link = (from x in doc.DocumentNode.Descendants()
where x.Name == "link"
&& x.Attributes["rel"] != null
&& x.Attributes["rel"].Value == "canonical"
&& x.Attributes["href"] != null
select x.Attributes["href"].Value).FirstOrDefault();
HtmlDocument html= new HtmlDocument(); doc.LoadHtml(_html);
var _canonical = html.DocumentNode.SelectSingleNode("//link[@rel='canonical']").Attributes["href"].Value;