Image scraper with C#

Question

I'm trying to go through a web pages source code, add the <img src="http://www.dot.com/image.jpg" to an HtmlElementCollection. Then I'm attempting to cycle through each element in the element collection with a foreach loop and download the images through the url.

Here's what I have so far. My problem right now is nothing is downloading, and I don't think my elements are being added properly by tag name. If they are I can't seem to reference them for the download.

public partial class Form1 : Form
{
    public Form1()
    {
        InitializeComponent();
    }

    public void button1_Click(object sender, EventArgs e)
    {
        string url = urlTextBox.Text;
        string sourceCode = WorkerClass.ScreenScrape(url);
        StreamWriter sw = new StreamWriter("sourceScraped.html");
        sw.Write(sourceCode);
    }

    private void button2_Click(object sender, EventArgs e)
    {
        string url = urlTextBox.Text;
        WebBrowser browser = new WebBrowser();
        browser.Navigate(url);
        HtmlElementCollection collection;
        List<HtmlElement> imgListString = new List<HtmlElement>();
        if (browser != null)
        {
            if (browser.Document != null)
            {
                collection = browser.Document.GetElementsByTagName("img");
                if (collection != null)
                {
                    foreach (HtmlElement element in collection)
                    {
                        WebClient wClient = new WebClient();
                        string urlDownload = element.FirstChild.GetAttribute("src");
                        wClient.DownloadFile(urlDownload, urlDownload.Substring(urlDownload.LastIndexOf('/')));
                    }
                }
            }
        }
    }
}

}

You're trying to go through a web page and add the... what? – Carey Gregory Jun 15 '12 at 03:59 — Carey Gregory, Jun 15 '12 at 03:59
Check the urlDownload value for a valid path. – jac Jun 15 '12 at 04:20 — jac, Jun 15 '12 at 04:20

score 3 · Accepted Answer · answered Jun 15 '12 at 04:31

Ones you call navigate, you assume document is ready to traverse and check for images. but practically it take some time to load. You need to wait until Document loading Completed.

Add event DocumentCompleted to your browser object

 browser.DocumentCompleted += browser_DocumentCompleted;

implement it as

static void browser_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
    WebBrowser browser = (WebBrowser)sender;
    HtmlElementCollection collection;
    List<HtmlElement> imgListString = new List<HtmlElement>();
    if (browser != null)
    {
        if (browser.Document != null)
        {
            collection = browser.Document.GetElementsByTagName("img");
            if (collection != null)
            {
                foreach (HtmlElement element in collection)
                {
                    WebClient wClient = new WebClient();
                    string urlDownload = element.GetAttribute("src");
                    wClient.DownloadFile(urlDownload, urlDownload.Substring(urlDownload.LastIndexOf('/')));
                }
            }
        }
    }
}

That's exactly what I did. It worked. I was just about to post my own answer! lol. — Keith, Jun 15 '12 at 05:03
Glad to hear that. Accept one of answer or you can post your own answer and accept it as answer if something different to this. — Damith, Jun 15 '12 at 05:09
Sorry. I didn't notice there was a place to accept the answer. I'm new here. — Keith, Jun 15 '12 at 19:10

score 0 · Answer 2 · answered Jun 15 '12 at 04:13

0

Take a look at Html Agility Pack.

What you need to do is download and parse the HTML, and then process the elements you are interested in. It is a good tool for such tasks.

answered Jun 15 '12 at 04:13

nunespascal

17,584
2
43
46

score 0 · Answer 3 · answered Jun 15 '12 at 19:13

To anyone interested, here was the solution. It's exactly what Damith said. I found Html Agility Pack to be rather broken. That was the first thing I tried using. This ended up being a more viable solution for me and this is my final code.

private void button2_Click(object sender, EventArgs e)
    {
        string url = urlTextBox.Text;
        WebBrowser browser = new WebBrowser();
        browser.Navigate(url);
        browser.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(DownloadFiles);
    }

    private void DownloadFiles(object sender, WebBrowserDocumentCompletedEventArgs e)
    {

        HtmlElementCollection collection;
        List<HtmlElement> imgListString = new List<HtmlElement>();

        if (browser != null)
        {
            if (browser.Document != null)
            {
                collection = browser.Document.GetElementsByTagName("img");
                if (collection != null)
                {
                    foreach (HtmlElement element in collection)
                    {
                        string urlDownload = element.GetAttribute("src");
                        if (urlDownload != null && urlDownload.Length != 0)
                        {
                            WebClient wClient = new WebClient();
                            wClient.DownloadFile(urlDownload, "C:\\users\\folder\\location\\" + urlDownload.Substring(urlDownload.LastIndexOf('/')));
                        }
                    }
                }
            }
        }
    }
}

}

Image scraper with C#

3 Answers3

Linked