0

I am trying to get images from google

String url = "https://www.google.com/search?site=imghp&tbm=isch&source=hp&q=audi&gws_rd=cr";
 org.jsoup.nodes.Document doc = Jsoup.connect(url).get();
 Elements elements = doc.select("div.isv-r.PNCib.MSM1fd.BUooTd");

ImageData is encoded in base64 so in order to get actual image url I first get the data id which is set as an attribute , this works

 for (Element element : elements) {
 String id = element.attr("data-id")).get();

I need to make new connection with url+"#imgrc="+id ,

org.jsoup.nodes.Document imgdoc = Jsoup.connect(url+"#"+id).get();

Now in the browser when I inspect my required data is present inside <div jsname="CGzTgf"> , so I also do the same in Jsoup

   Elements images = imgdoc.select("div[jsname='CGzTgf']");
   //futher steps

But images always return empty , I am unable to find the error , I do this inside new thread in android , any help will be appreciated

aryanknp
  • 1,135
  • 2
  • 8
  • 21
  • Are you trying to download the images? I'm not clear on why you're looking in the div tags rather than the `a` -> `img src=` tag – Rob Evans Nov 10 '20 at 09:18
  • @RobEvans I am trying to get src attribute of the image – aryanknp Nov 10 '20 at 09:19
  • @RobEvans thats because img is present in the third child node of that div , directly I am also getting few top thumbnails which I dont need , also if i do directly it will give me base 64 encoded small dimension image – aryanknp Nov 10 '20 at 09:21
  • I've got it... Almost got a working example - just need to write the contents to a file – Rob Evans Nov 10 '20 at 09:27
  • @RobEvans , I found this https://stackoverflow.com/a/63926580/8719734 but this is not working in jsoup – aryanknp Nov 10 '20 at 10:25
  • I have the images but they're embedded Gifs.. I'm a little stuck trying to convert them back to files as theyr'e base64 encoded. Should be able to get a working solution but may take a little while – Rob Evans Nov 10 '20 at 10:26
  • Problem is that images is added through javascript and hence it remains empty , successfully wasted my time – aryanknp Nov 10 '20 at 11:50
  • Approach was wrong - I've provided a working solution now with an explanation. If it works for you pls give it a +1 and accept the answer so we both get the reputation points :) – Rob Evans Nov 10 '20 at 14:26
  • @RobEvans I already voted and will accept after checking your solution thanks – aryanknp Nov 10 '20 at 14:31
  • I attached an image as evidence it works :) – Rob Evans Nov 10 '20 at 14:34

1 Answers1

1

Turns out the way you're doing it you'll be looking in the wrong place entirely. The urls are contained within some javascript <script> tag included in the response.

I've extracted and filtered fro the relevant <script> tag (one containing attribute nonce.

I then filter those tags for one containing a specific function name used AND a generic search string I'm expecting to find (something that won't be in the other <script> tags).

Next, the value obtained needs to be stripped to get the JSON object containing about a hundred thousand arrays. I've then navigated this (manually), to pull out a subset of nodes containing relevant URL nodes. I then filter this again to get a List<String> to get the full URLs.

Finally I've reused some code from an earlier solution here: https://stackoverflow.com/a/63135249/7619034 with something similar to download images.

You'll then also get some console output detailing which URL ended up in which file id. Files are labeled image_[x].jpg regardless of actual format (so you may need to rework it a little - Hint: take file extension from url if provided).

import com.jayway.jsonpath.JsonPath;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.net.HttpURLConnection;
import java.net.URL;
import java.util.List;

public class GoogleImageDownloader {

    private static int TIMEOUT = 30000;
    private static final int BUFFER_SIZE = 4096;

    public static final String RELEVANT_JSON_START = "AF_initDataCallback(";
    public static final String PARTIAL_GENERIC_SEARCH_QUERY = "/search?q";

    public static void main(String[] args) throws IOException {
        String url = "https://www.google.com/search?site=imghp&tbm=isch&source=hp&q=audi&gws_rd=cr";
        Document doc = Jsoup.connect(url).get();

        // Response with relevant data is in a <script> tag
        Elements elements = doc.select("script[nonce]");

        String jsonDataElement = getRelevantScriptTagContainingUrlDataAsJson(elements);
        String jsonData = getJsonData(jsonDataElement);
        List<String> imageUrls = getImageUrls(jsonData);

        int fileId = 1;
        for (String urlEntry : imageUrls) {
            try {
                writeToFile(fileId, makeImageRequest(urlEntry));
                System.out.println(urlEntry + " : " + fileId);
                fileId++;
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }

    private static String getRelevantScriptTagContainingUrlDataAsJson(Elements elements) {
        String jsonDataElement = "";
        int count = 0;
        for (Element element : elements) {
            String jsonData = element.data();
            if (jsonData.startsWith(RELEVANT_JSON_START) && jsonData.contains(PARTIAL_GENERIC_SEARCH_QUERY)) {
                jsonDataElement = jsonData;
                // IF there are two items in the list, take the 2nd, rather than the first.
                if (count == 1) {
                    break;
                }
                count++;
            }
        }
        return jsonDataElement;
    }

    private static String getJsonData(String jsonDataElement) {
        String jsonData = jsonDataElement.substring(RELEVANT_JSON_START.length(), jsonDataElement.length() - 2);
        return jsonData;
    }

    private static List<String> getImageUrls(String jsonData) {
        // Reason for doing this in two steps is debugging is much faster on the smaller subset of json data
        String urlArraysList = JsonPath.read(jsonData, "$.data[31][*][12][2][*]").toString();
        List<String> imageUrls = JsonPath.read(urlArraysList, "$.[*][*][3][0]");
        return imageUrls;
    };

    private static void writeToFile(int i, HttpURLConnection response) throws IOException {
        // opens input stream from the HTTP connection
        InputStream inputStream = response.getInputStream();

        // opens an output stream to save into file
        FileOutputStream outputStream = new FileOutputStream("image_" + i + ".jpg");

        int bytesRead = -1;
        byte[] buffer = new byte[BUFFER_SIZE];
        while ((bytesRead = inputStream.read(buffer)) != -1) {
            outputStream.write(buffer, 0, bytesRead);
        }
        outputStream.close();
        inputStream.close();

        System.out.println("File downloaded");
    }

    // Could use JSoup here but I'm re-using this from an earlier answer
    private static HttpURLConnection makeImageRequest(String imageUrlString) throws IOException {
        URL imageUrl = new URL(imageUrlString);
        HttpURLConnection response = (HttpURLConnection) imageUrl.openConnection();
        response.setRequestMethod("GET");
        response.setConnectTimeout(TIMEOUT);
        response.setReadTimeout(TIMEOUT);
        response.connect();
        return response;
    }
}

Partial Result I tested with:

enter image description here

I've used JsonPath for filtering the relevant nodes which is good when you only care about a small portion of the JSON and don't want to deserialise the whole object. It follows a similar navigation style to DOM/XPath/jQuery navigation.

Apart from this one library and Jsoup, the libraries used are very bog standard.

Good Luck!

Rob Evans
  • 2,822
  • 1
  • 9
  • 15
  • thanks for your effort but that image will not have the same dimension as the orignal image from the url be it gif or any other format – aryanknp Nov 10 '20 at 10:34
  • **Eric Cartman** I've been through this one before. `ByteArrayInputStream bis = new ByteArrayInputStream(Base64.getDecoder().decode(base64EncodedImage));` – Y2020-09 Nov 10 '20 at 11:56
  • `ByteArrayOutputStream tmp = new ByteArrayOutputStream();` – Y2020-09 Nov 10 '20 at 11:58
  • `ImageIO.write(image, ext.extension, tmp);` – Y2020-09 Nov 10 '20 at 11:58
  • Look again @aryanagarwal - got it working by extracting URLs from the embedded Json. Presumably this works on any machine but let us know if not – Rob Evans Nov 10 '20 at 14:27
  • @Y2020-09 I tried something very similar to this and a number of variants - seemed all the embedded gifs were 1x1 pixel images. I'm not a front end dev so I don't really understand why this is done other than for tracking purposes, but getting the image back out didn't appear to work. – Rob Evans Nov 10 '20 at 14:28
  • 1
    @RobEvans thats my new google image search app , thanks a lot man! – aryanknp Nov 10 '20 at 15:15
  • @RobEvans hey man , i need your help this is no longer working now – aryanknp Apr 19 '21 at 09:40
  • @RobEvans this is getting empty images array , earlier it used to get all images , but i think google patched this up , please reply with @ – aryanknp Apr 21 '21 at 01:12
  • @aryanagarwal you may have been blocked? Its working fine my end - Can you try from another IP & machine? – Rob Evans Apr 21 '21 at 16:55
  • @RobEvans I am trying from other as well but it is not working in the morning i think but late night it works – aryanknp Apr 22 '21 at 03:01
  • Ok I have the same issue with it not working now. when getRelevantScriptTagContainingUrlDataAsJson is called, there are two items in the list starting with the text tag we're looking for. I'm assuming one is good, the other is bad. Since we stop at the first one, its probably the last one thats wanted. Not sure why 2 are being returned but its clearly a recent change. – Rob Evans Apr 23 '21 at 07:48
  • @aryanagarwal I have made a small change to the method and updated the code in the original answer to get the 2nd version of the interesting HTML if it exists. This may fix one part of the problem but break the other - its hard to know what the difference is without trying it at all times of the day. See how you get on. – Rob Evans Apr 23 '21 at 07:59