How to get img/src or a/hrefs using Html Agility Pack?

Question

I want to use the HTML agility pack to parse image and href links from a HTML page,but I just don't know much about XML or XPath.Though having looking up help documents in many web sites,I just can't solve the problem.In addition,I use C# in VisualStudio 2005.And I just can't speak English fluently,so,I will give my sincere thanks to the one can write some helpful codes.

And,can Html Agility Pack solve Relative Paths? – iShow Jan 29 '11 at 08:30 — iShow, Jan 29 '11 at 08:30

score 26 · Accepted Answer · answered Jan 29 '11 at 08:51

26

The first example on the home page does something very similar, but consider:

 HtmlDocument doc = new HtmlDocument();
 doc.Load("file.htm"); // would need doc.LoadHtml(htmlSource) if it is not a file
 foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[@href"])
 {
    string href = link["href"].Value;
    // store href somewhere
 }

So you can imagine that for img@src, just replace each a with img, and href with src. You might even be able to simplify to:

 foreach(HtmlNode node in doc.DocumentElement
              .SelectNodes("//a/@href | //img/@src")
 {
    list.Add(node.Value);
 }

For relative url handling, look at the Uri class.

answered Jan 29 '11 at 08:51

Marc Gravell

1,026,079
266
2,566
2,900

3

I get error: DocumentElement not exists in HtmlDocument object for 1.4.0.0 version HtmlAgilitypack foreach(HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]")) { HtmlAttribute att = link.Attributes["href"]; – Kiquenet Apr 06 '12 at 12:44
Not positive but I believe DocumentElement is now DocumentNode – James Hurley Apr 11 '23 at 17:14

score 7 · Answer 2 · edited Apr 04 '14 at 16:08

The example and the accepted answer is wrong. It doesn't compile with the latest version. I try something else:

    private List<string> ParseLinks(string html)
    {
        var doc = new HtmlDocument(); 
        doc.LoadHtml(html);
        var nodes = doc.DocumentNode.SelectNodes("//a[@href]");
        return nodes == null ? new List<string>() : nodes.ToList().ConvertAll(
               r => r.Attributes.ToList().ConvertAll(
               i => i.Value)).SelectMany(j => j).ToList();
    }

This works for me.

score 2 · Answer 3 · answered Sep 06 '16 at 09:58

2

Maybe I am too late here to post an answer. The following worked for me:

var MainImageString  = MainImageNode.Attributes.Where(i=> i.Name=="src").FirstOrDefault();

answered Sep 06 '16 at 09:58

Abhay Shiro

3,431
2
16
26

score 2 · Answer 4 · answered Apr 12 '19 at 15:18

2

var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);

string name = htmlDoc.DocumentNode
    .SelectNodes("//td/input")
    .First()
    .Attributes["value"].Value;

Source: https://html-agility-pack.net/select-nodes

answered Apr 12 '19 at 15:18

PhoenixRebirthed

462
6
15

score 0 · Answer 5 · answered Apr 16 '18 at 09:16

You also need to take into account the document base URL element (<base>) and protocol relative URLs (for example //www.foo.com/bar/).

For more information check:

<base>: The Document Base URL element page on MDN
The Protocol-relative URL article by Paul Irish
What are the recommendations for html tag? discussion on StackOverflow
Uri Constructor (Uri, Uri) page on MSDN
Uri class doesn't handle the protocol-relative URL discussion no StackOverflow

score 0 · Answer 6 · answered Dec 23 '21 at 00:32

Late post, but here's a 2021 update to the accepted answer (fixes the refactoring that HtmlAgilityPack made.

    var htmlDoc = new HtmlDocument();
    htmlDoc.LoadHtml(html);
    string command = "";

    // The Xpath below gets images.  
    // It is specific to a site.  Yours will vary ...
    command = "//a[contains(concat(' ', @class, ' '), 'product-card')]//img";  
    List<string> listImages=new();
    foreach(HtmlNode node in doc.DocumentNode.SelectNodes(command))
    {
        //  Using "data-src" below, but it may be "src" for you
        listImages.Add(node.Attributes["data-src"].Value);
    }

How to get img/src or a/hrefs using Html Agility Pack?

6 Answers6

Linked

Related