I want to use the HTML agility pack to parse image and href links from a HTML page,but I just don't know much about XML or XPath.Though having looking up help documents in many web sites,I just can't solve the problem.In addition,I use C# in VisualStudio 2005.And I just can't speak English fluently,so,I will give my sincere thanks to the one can write some helpful codes.
Asked
Active
Viewed 3.9k times
11
-
And,can Html Agility Pack solve Relative Paths? β iShow Jan 29 '11 at 08:30
6 Answers
26
The first example on the home page does something very similar, but consider:
HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm"); // would need doc.LoadHtml(htmlSource) if it is not a file
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[@href"])
{
string href = link["href"].Value;
// store href somewhere
}
So you can imagine that for img@src, just replace each a
with img
, and href
with src
.
You might even be able to simplify to:
foreach(HtmlNode node in doc.DocumentElement
.SelectNodes("//a/@href | //img/@src")
{
list.Add(node.Value);
}
For relative url handling, look at the Uri
class.

Marc Gravell
- 1,026,079
- 266
- 2,566
- 2,900
-
3I get error: DocumentElement not exists in HtmlDocument object for 1.4.0.0 version HtmlAgilitypack foreach(HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]")) { HtmlAttribute att = link.Attributes["href"]; β Kiquenet Apr 06 '12 at 12:44
-
7
The example and the accepted answer is wrong. It doesn't compile with the latest version. I try something else:
private List<string> ParseLinks(string html)
{
var doc = new HtmlDocument();
doc.LoadHtml(html);
var nodes = doc.DocumentNode.SelectNodes("//a[@href]");
return nodes == null ? new List<string>() : nodes.ToList().ConvertAll(
r => r.Attributes.ToList().ConvertAll(
i => i.Value)).SelectMany(j => j).ToList();
}
This works for me.
2
Maybe I am too late here to post an answer. The following worked for me:
var MainImageString = MainImageNode.Attributes.Where(i=> i.Name=="src").FirstOrDefault();

Abhay Shiro
- 3,431
- 2
- 16
- 26
2
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
string name = htmlDoc.DocumentNode
.SelectNodes("//td/input")
.First()
.Attributes["value"].Value;

PhoenixRebirthed
- 462
- 6
- 15
0
You also need to take into account the document base URL element (<base>
) and protocol relative URLs (for example //www.foo.com/bar/
).
For more information check:
- <base>: The Document Base URL element page on MDN
- The Protocol-relative URL article by Paul Irish
- What are the recommendations for html tag? discussion on StackOverflow
- Uri Constructor (Uri,βUri) page on MSDN
- Uri class doesn't handle the protocol-relative URL discussion no StackOverflow

Leonid Vasilev
- 11,910
- 4
- 36
- 50
0
Late post, but here's a 2021 update to the accepted answer (fixes the refactoring that HtmlAgilityPack made.
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
string command = "";
// The Xpath below gets images.
// It is specific to a site. Yours will vary ...
command = "//a[contains(concat(' ', @class, ' '), 'product-card')]//img";
List<string> listImages=new();
foreach(HtmlNode node in doc.DocumentNode.SelectNodes(command))
{
// Using "data-src" below, but it may be "src" for you
listImages.Add(node.Attributes["data-src"].Value);
}

mike g
- 41
- 4