3

I am developing an application that needs to get the images on the first page of a google image search. I have already figured out how to scrape the HTML on a google search query, and how to open an URL and get the bytes of the photo and save it as an Image object so I can display it on a Windows Form and save it to PC.

But since I am not that good at HTML parsing, finding objects in HTML, and HTML in general, I would like a method in which I would feed the HTML of the page, and it would return a list of strings of URL's of images in the HTML. I would like the full res photo URL, but for now anything would do.

I have tried this solution but if I try the top answer to that solution, ndx is -1. As far as my knowledge goes, I'm guessing this is because Google edited their HTML and removed/renamed/changed the implementation of the images_table class?

This is the code of the answer linked above-:

private List<string> GetUrls(string html)

    {
        var urls = new List<string>();
        int ndx = html.IndexOf("class=\"images_table\"", StringComparison.Ordinal);
        ndx = html.IndexOf("<img", ndx, StringComparison.Ordinal);

    while (ndx >= 0)
    {
        ndx = html.IndexOf("src=\"", ndx, StringComparison.Ordinal);
        ndx = ndx + 5;
        int ndx2 = html.IndexOf("\"", ndx, StringComparison.Ordinal);
        string url = html.Substring(ndx, ndx2 - ndx);
        urls.Add(url);
        ndx = html.IndexOf("<img", ndx, StringComparison.Ordinal);
    }
    return urls;
}

How can I re implement this method so that it works as intended? I am using C#. If there is anything wrong that I did with the question or formatting, any information I need to provide please tell me as I am new to programming and StackOverflow. You can also suggest another website or API (free) I can use to get images from the web. Thanks in advance.

Spider Wings
  • 71
  • 1
  • 6
  • Use a WebBrowser control to navigate to a Google Image search page, subscribe to the the `[WebBrowser].DocumentCompleted` event, where you verify that `[WebBrowser].Readyste = WebBrowserReadyState.Complete`. After that, you have all Images in a single collection: `[WebBrowser].Document.Images`. `src` is a HtmlElement attribute, so `string src = [WebBrowser].Document.Images[0].GetAttribute("src");`. You'll find out that most of the thumbnails are stored in the HTML as Base64 strings. Use `Convert.FromBase64String()` to get the byte array. – Jimi Jun 06 '20 at 16:07
  • The link to the *real thing* is instead in an anchor two levels up. But you also have a `[WebBrowser].Document.Links` collection available that you can match. – Jimi Jun 06 '20 at 16:13
  • Errata corrige: with `WebBrowser control` I meant `WebBrowser class`: you don't need a Control, just the WeBrowser class object. You can also use [HtmlAgilityPack](https://html-agility-pack.net/), it provides similar functionalities (probably too many for a simple task like this). BTW, if you decide to use a WebBrowser object, read the notes here: [How to get an HtmlElement value inside Frames/IFrames?](https://stackoverflow.com/a/53218064/7444103) (there's also a implementation, similar to yours). – Jimi Jun 06 '20 at 16:18
  • @Jimi While converting the Base 64 String to a byte array, I am getting a System.FormatException. The exception says -: System.FormatException Message=The input is not a valid Base-64 string as it contains a non-base 64 character, more than two padding characters, or an illegal character among the padding characters. – Spider Wings Jun 07 '20 at 12:46
  • Did you parse that string? When the Image is in base64 format, it starts with, e.g., data:image/jpeg;base64,, in case of a JPEG image. The base64 string begins after the comma. Also, you can find both base64-embedded images and standard Http/https links. You need to handle both cases. If you need an example, let me know. You didn't specify what tools you're using. – Jimi Jun 07 '20 at 13:18
  • Yes, I googled a little and figured out that was the issue. I am using substring to parse and let's see if that works. – Spider Wings Jun 07 '20 at 13:26
  • With `src` as the base64 string you parsed: `Image image = Image.FromStream(new MemoryStream(Convert.FromBase64String(src)));` – Jimi Jun 07 '20 at 17:55
  • Just replace "class=\"images_table\"" of the first version in the source of the link you quted by "table class=" and it works fine. – Fredy Nov 23 '22 at 14:16

0 Answers0