How can i download an image from html documents with wildcards

Question

I'm writing a c# program to pull a .jpg image from an HTML document, but the name of the target image changes every so often. me being a very new programmer, i can not figure out how to achieve the desired result.

I am using webclient to download the html.

so i guess i have a few questions to ask here.

how can i use a wildcard to assume the name and length of the image name?
and how can i trim the HTML containers away from the target image in the document?

So you want every instance of the URL to the image/images? And then you want to go and download that image? — sealz, Aug 28 '13 at 20:10
Does the target image appear in a particular place in the HTML? Can you identify it by the surrounding characters? Have you looked at the string class to see how you might use those methods to locate and extract particular substrings? — Jim Mischel, Aug 28 '13 at 20:12
As Jim mentioned I have done somthing like that. Read the entire html in as a string>>read until you hit the start of a link>>read till you hit the image extension>>Webclient download. Continue moving through the file. If you show us what you have tried or what has you stuck we can help you. — sealz, Aug 28 '13 at 20:14
it is identifiable by a path, so targeting the location of it in the html is not a problem. its the name of the image that is troublesome — DataHead, Aug 28 '13 at 20:22

score 2 · Answer 1 · answered Aug 28 '13 at 20:16

In short, using the approach you've described: you can't. HTTP requires that each individual requested resource be accessed by its name, you cannot ask a HTTP server to return a set of resources whose names match a pattern (be it a wildcard expression or a regex).

If, however, you know the names exist between a particular range and follow a pattern then you could create a series of requests and handle 404 errors accordingly, like so:

String resource = "/images/aestheticallyAttractiveHumanFemalesWithoutClothing/img_{0}.jpg";
for(int i=1;i<100;i++) {

    String thisResource = String.Format(CultureInfo.InvariantCulture, resource, i);

    HttpWebRequest request = new (HttpWebRequest)WebRequest.Create(thisResource);
    HttpWebResponse response = request.GetResponse();
    if( response.Status == HttpStatus.OK ) {
        using(Stream rs = response.GetResponseStream())
        using(FileStream fs = new FileStream(Path.Combine("C:\\Temp\\IRSTaxReturns2011\\" + i.ToString() + ".jpg") {
            rs.CopyTo( fs );
        }
    }
}

score 2 · Answer 2 · edited May 23 '17 at 12:21

You should scrape the webpage to get the image url then download the image. For the scraping check out:

https://github.com/jamietre/CsQuery

https://code.google.com/p/fizzler/

https://code.google.com/p/sharp-query/

Is there a jQuery-like CSS/HTML selector that can be used in C#?

These will allow to you select the element you care about based on attribute name, position in the document, or a combo of these identifiers and then get the src attribute.

Download webpage html
Parse html to get the url of the image
Download the image

Edit: @Jacob Proffitt Cool stuff if your ok with XPath

http://htmlagilitypack.codeplex.com/

How to use HTML Agility pack

I'd probably go with HtmlAgilityPack, but that's a small tweak to a solid answer... — Jacob Proffitt, Aug 28 '13 at 21:19

How can i download an image from html documents with wildcards

2 Answers2