28

I just downloaded the HTMLAgilityPack and the documentation doesn't have any examples.

I'm looking for a way to download all the images from a website. The address strings, not the physical image.

<img src="blabalbalbal.jpeg" />

I need to pull the source of each img tag. I just want to get a feel for the library and what it can offer. Everyone said this was the best tool for the job.

Edit

public void GetAllImages()
    {
        WebClient x = new WebClient();
        string source = x.DownloadString(@"http://www.google.com");

        HtmlAgilityPack.HtmlDocument document = new HtmlAgilityPack.HtmlDocument();
        document.Load(source);

                         //I can't use the Descendants method. It doesn't appear.
        var ImageURLS = document.desc
                   .Select(e => e.GetAttributeValue("src", null))
                   .Where(s => !String.IsNullOrEmpty(s));        
    }
RAS
  • 8,100
  • 16
  • 64
  • 86
Sergio Tapia
  • 40,006
  • 76
  • 183
  • 254

2 Answers2

50

You can do this using LINQ, like this:

var document = new HtmlWeb().Load(url);
var urls = document.DocumentNode.Descendants("img")
                                .Select(e => e.GetAttributeValue("src", null))
                                .Where(s => !String.IsNullOrEmpty(s));

EDIT: This code now actually works; I had forgotten to write document.DocumentNode.

SLaks
  • 868,454
  • 176
  • 1,908
  • 1,964
10

Based on their one example, but with modified XPath:

 HtmlDocument doc = new HtmlDocument();
 List<string> image_links = new List<string>();
 doc.Load("file.htm");
 foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//img"))
 {
    image_links.Add( link.GetAttributeValue("src", "") );
 }

I don't know this extension, so I'm not sure how to write out the array to somewhere else, but that will at least get you your data. (Also, I don't define the array correctly, I'm sure. Sorry).

Edit

Using your example:

public void GetAllImages()
    {
        WebClient x = new WebClient();
        string source = x.DownloadString(@"http://www.google.com");

        HtmlAgilityPack.HtmlDocument document = new HtmlAgilityPack.HtmlDocument();
        List<string> image_links = new List<string>();
        document.Load(source);

        foreach(HtmlNode link in document.DocumentElement.SelectNodes("//img"))
        {
          image_links.Add( link.GetAttributeValue("src", "") );
       }


    }
Anthony
  • 36,459
  • 25
  • 97
  • 163
  • 1
    Make that: `List image_links = new List(); foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//img")) { image_links.Add( link.GetAttributeValue("src", "") ); }` – TaW Feb 23 '15 at 09:58