0

I have a bit of a pickle. There are a list of images I want to grab on a website. I know how to do that much, but I have to filter out the location of the images.

Such as I'd want to grab the images in a div tag with an id "theseImages", but there are another set of images within another div tag with an id called "notTheseImages". Looping through every tag into ah HtmlElementCollection with the tag "img" would ignore the divs, because it'd also grab the images from "notTheseImages."

Is there a way I could loop through the images while doing a check to see where those images are located in the div tags?

  • What are you using, are you using Winforms with the WebBrowser component? if so you could get the div itself and then loop to the childcollection to get the images in question, can you show some code and show what you have tried so far? (I also am a bit of a pickle :D) – Icepickle May 06 '15 at 10:33
  • 4
    Please show some code – Ranjit Singh May 06 '15 at 10:34
  • Use CSQuery (which has jQuery style selectors) to easily separate out collections from a webpage's HTML. Best you show example HTML and code you have tried too. :) – iCollect.it Ltd May 06 '15 at 10:37
  • 1
    Code is required, but sounds like a job for [HtmlAgilityPack](https://htmlagilitypack.codeplex.com/) – anothershrubery May 06 '15 at 10:41
  • @Icepickle Yes, I'm using the WebBrowser component. This is what I'm using: HtmlElementCollection hec = webBrowser1.Document.GetElementsByTagName("img"); foreach(HtmlElement he in hec) { } For example HTML code:
    Doing what I did will ignore the two divs and just grab all 4 images.
    – hanahouhanah May 06 '15 at 10:52
  • @hanahouhanah I've answered a question on here which used csquery. Lets you use css/jquery style selectors on the html and was v. easy to use. http://stackoverflow.com/questions/22092208/parsing-html-with-csquery Since then I've used that. – hutchonoid May 06 '15 at 10:55

1 Answers1

0

This could help you to do the selection of your current HTML and maybe for future occassions :)

protected HtmlElement[] GetElementsByParent(HtmlDocument document, HtmlElement baseElement = null, params string[] singleSelectors)
{
    if (singleSelectors == null || singleSelectors.Length == 0)
    {
        throw new Exception("Please give at least 1 selector!");
    }
    IList<HtmlElement> result = new List<HtmlElement>();
    bool last = singleSelectors.Length == 1;
    string singleSelector = singleSelectors[0];
    if (string.IsNullOrWhiteSpace(singleSelector) || string.IsNullOrWhiteSpace(singleSelector.Trim()))
    {
        return null;
    }
    singleSelector = singleSelector.Trim();
    if (singleSelector.StartsWith("#"))
    {
        var item = document.GetElementById(singleSelector.Substring(1));
        if (item == null)
        {
            return null;
        }
        if (last)
        {
            result.Add(item);
        }
        else
        {
            var results = GetElementsByParent(document, item, singleSelectors.Skip(1).ToArray());
            if (results != null && results.Length > 0)
            {
                foreach (var res in results)
                {
                    result.Add(res);
                }
            }
        }
    }
    else if (singleSelector.StartsWith("."))
    {
        if (baseElement == null)
        {
            baseElement = document.Body;
        }
        foreach (HtmlElement child in baseElement.Children)
        {
            string cls;
            if (!string.IsNullOrWhiteSpace((cls = child.GetAttribute("class"))))
            {
                if (cls.Split(' ').Contains(singleSelector.Substring(1)))
                {
                    if (last)
                    {
                        result.Add(child);
                    }
                    else
                    {
                        var results = GetElementsByParent(document, child, singleSelectors.Skip(1).ToArray());
                        if (results != null && results.Length > 0)
                        {
                            foreach (var res in results)
                            {
                                result.Add(res);
                            }
                        }
                    }
                }
            }
        }
    }
    else
    {
        HtmlElementCollection elements = null;

        if (baseElement != null)
        {
            elements = baseElement.GetElementsByTagName(singleSelector);
        }
        else
        {
            elements = document.GetElementsByTagName(singleSelector);
        }
        foreach (HtmlElement item in elements)
        {
            if (last)
            {
                result.Add(item);
            }
            else
            {
                var results = GetElementsByParent(document, item, singleSelectors.Skip(1).ToArray());
                if (results != null && results.Length > 0)
                {
                    foreach (var res in results)
                    {
                        result.Add(res);
                    }
                }
            }
        }
    }
    return result.ToArray();
}

private void webBrowser1_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
    // here we can query
    var result = GetElementsByParent(webBrowser1.Document, null, "#theseImages", "img");
}

result would then contain the images that are under #theseImages

Mind you the GetElementsByParent is fairly untested, I just tested it for your use case and it seemed to be ok.

Don't forget to only start the query once you are sure the document is completed ;)

Icepickle
  • 12,689
  • 3
  • 34
  • 48
  • You are welcome, and @hanahouhanah if it answers your question, then feel free to mark it as such as well :) – Icepickle May 10 '15 at 11:47