0

I'm trying to scrape a google shopping query for the attributes of search results (html) with Jsoup. Before I tried executing any tasks with the results, I wanted to make sure that I was actually getting the proper html from Jsoup. So I simply added a System.out.println(Document.toString()); in my Asynctask to see what I was working with. As I suspected, the resultant html was not complete. Here is the code I was running followed by its result :

*(The Search Query is hard-coded to "scarf walmart" for testing purposes)

    public class fetcher extends AsyncTask<Void, Void, Integer>{

    @Override
    protected Integer doInBackground(Void... voids) {
        try{

            Connection.Response response= Jsoup.connect("https://www.google.ca/search?q=scarf+walmart&tbm=shop")
                    .ignoreContentType(true)
                    .userAgent("Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36")
                    .referrer("http://www.google.ca")
                    .timeout(12000)
                    .followRedirects(true)
                    .execute();

            doc = response.parse();

        } catch (IOException e){
            e.printStackTrace();
        }

        return 1;
    }

    @Override
    protected void onPostExecute(Integer integer) {
        System.out.println(doc.toString());
    }
}

This gave me some seemingly good results, except that I am only getting some of 20+ of the search results on the first page. I suspect that this may have something to do with my userAgent value, but i'm not sure how I would go about fixing that.

Edit -> Everytime I run the app, I get a different amount of search results showing up in the source code.

So my question is, How do I get all of the google search results (Or at least a consistent number of them) to show up when I fetch the html using Jsoup?

Any help is appreciated!

Update 2: I've experimented with my code and tried commenting out these lines in:

    protected Integer doInBackground(Void... voids) {
        try{

            Connection.Response response= Jsoup.connect("https://www.google.ca/search?q=scarf+walmart&tbm=shop")
                    .ignoreContentType(true)
                    .userAgent("Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36")
                    //.referrer("http://www.google.ca")
                    //.timeout(100000)
                    //.followRedirects(true)
                    .execute();

            doc = response.parse();

        } catch (IOException e){
            e.printStackTrace();
        }

        return 1;
    }

I'm now certainly getting a LOT more results, but it's still skipping some, any ideas?

Results Update: (Out of 40 Results)

Using Javascript Interface: (20 Results)

09:40:11.909 18077-18077/com.painlessshopping.mohamed.findit I/System.out: >$12.97

09:40:11.909 18077-18077/com.painlessshopping.mohamed.findit I/System.out: >$12.97

09:40:11.910 18077-18077/com.painlessshopping.mohamed.findit I/System.out: >$19.97

09:40:11.910 18077-18077/com.painlessshopping.mohamed.findit I/System.out: $12.97

09:40:11.910 18077-18077/com.painlessshopping.mohamed.findit I/System.out: $19.97

09:40:11.910 18077-18077/com.painlessshopping.mohamed.findit I/System.out: $29.97

09:40:11.910 18077-18077/com.painlessshopping.mohamed.findit I/System.out: $29.97

09:40:11.911 18077-18077/com.painlessshopping.mohamed.findit I/System.out: $14.97

09:40:11.911 18077-18077/com.painlessshopping.mohamed.findit I/System.out: $7.97

09:40:11.911 18077-18077/com.painlessshopping.mohamed.findit I/System.out: $7.97

09:40:11.911 18077-18077/com.painlessshopping.mohamed.findit I/System.out: $12.97

09:40:11.911 18077-18077/com.painlessshopping.mohamed.findit I/System.out: $12.97

09:40:11.912 18077-18077/com.painlessshopping.mohamed.findit I/System.out: $12.97

09:40:11.912 18077-18077/com.painlessshopping.mohamed.findit I/System.out: $16.97

09:40:11.912 18077-18077/com.painlessshopping.mohamed.findit I/System.out: $19.97

09:40:11.912 18077-18077/com.painlessshopping.mohamed.findit I/System.out: $16.97

09:40:11.912 18077-18077/com.painlessshopping.mohamed.findit I/System.out: $14.97

09:40:11.913 18077-18077/com.painlessshopping.mohamed.findit I/System.out: $12.97

09:40:11.913 18077-18077/com.painlessshopping.mohamed.findit I/System.out: $14.97

09:40:11.913 18077-18077/com.painlessshopping.mohamed.findit I/System.out: $14.97

Using the above Code: (Ranges from ~20 to 36 Results)

11-20 10:05:23.540 16788-16855/com.painlessshopping.mohamed.findit I/System.out: $12.97 from Walmart.ca

11-20 10:05:23.540 16788-16855/com.painlessshopping.mohamed.findit I/System.out: $12.97 from Walmart.ca

11-20 10:05:23.540 16788-16855/com.painlessshopping.mohamed.findit I/System.out: $12.97 from Walmart.ca

11-20 10:05:23.541 16788-16855/com.painlessshopping.mohamed.findit I/System.out: $19.97 from Walmart.ca

11-20 10:05:23.541 16788-16855/com.painlessshopping.mohamed.findit I/System.out: $16.97 from Walmart.ca

11-20 10:05:23.541 16788-16855/com.painlessshopping.mohamed.findit I/System.out: $29.97 from Walmart.ca

11-20 10:05:23.542 16788-16855/com.painlessshopping.mohamed.findit I/System.out: $19.97 from Walmart.ca

11-20 10:05:23.542 16788-16855/com.painlessshopping.mohamed.findit I/System.out: $19.99 from Walmart.ca

11-20 10:05:23.542 16788-16855/com.painlessshopping.mohamed.findit I/System.out: $29.97 from Walmart.ca

11-20 10:05:23.543 16788-16855/com.painlessshopping.mohamed.findit I/System.out: $19.97 from Walmart.ca

11-20 10:05:23.543 16788-16855/com.painlessshopping.mohamed.findit I/System.out: $16.97 from Walmart.ca

11-20 10:05:23.544 16788-16855/com.painlessshopping.mohamed.findit I/System.out: $39.97 from Walmart.ca

11-20 10:05:23.545 16788-16855/com.painlessshopping.mohamed.findit I/System.out: $14.97 from Walmart.ca

11-20 10:05:23.545 16788-16855/com.painlessshopping.mohamed.findit I/System.out: $12.97 from Walmart.ca

11-20 10:05:23.546 16788-16855/com.painlessshopping.mohamed.findit I/System.out: $7.97 from Walmart.ca

11-20 10:05:23.546 16788-16855/com.painlessshopping.mohamed.findit I/System.out: $14.97 from Walmart.ca

11-20 10:05:23.547 16788-16855/com.painlessshopping.mohamed.findit I/System.out: $29.98 from Walmart.ca

11-20 10:05:23.547 16788-16855/com.painlessshopping.mohamed.findit I/System.out: $7.97 from Walmart.ca

11-20 10:05:23.547 16788-16855/com.painlessshopping.mohamed.findit I/System.out: $14.97 from Walmart.ca

11-20 10:05:23.548 16788-16855/com.painlessshopping.mohamed.findit I/System.out: $12.97 from Walmart.ca

11-20 10:05:23.548 16788-16855/com.painlessshopping.mohamed.findit I/System.out: $7.97 from Walmart.ca

11-20 10:05:23.548 16788-16855/com.painlessshopping.mohamed.findit I/System.out: $7.97 from Walmart.ca

11-20 10:05:23.549 16788-16855/com.painlessshopping.mohamed.findit I/System.out: $7.97 from Walmart.ca

11-20 10:05:23.549 16788-16855/com.painlessshopping.mohamed.findit I/System.out: $9.97 from Walmart.ca

11-20 10:05:23.550 16788-16855/com.painlessshopping.mohamed.findit I/System.out: $12.97 from Walmart.ca

11-20 10:05:23.550 16788-16855/com.painlessshopping.mohamed.findit I/System.out: $12.98 from Walmart.ca

11-20 10:05:23.550 16788-16855/com.painlessshopping.mohamed.findit I/System.out: $22.97 from Walmart.ca

11-20 10:05:23.551 16788-16855/com.painlessshopping.mohamed.findit I/System.out: $6.87 from Etsy - ashton11

Yamaha T64312
  • 103
  • 10
  • Likely JavaScript related: jsoup doesn't provide JavaScript support. On Android using a WebView to render the page and then passing the resulting html source to jsoup with a JavasciptInterface works well. Compare: http://stackoverflow.com/a/39174441/1661938 – Frederic Klein Nov 19 '16 at 19:49
  • I tried that already and it didn't help :/ – Yamaha T64312 Nov 20 '16 at 00:47
  • Also, I need to display information about the results in a different format, so having a WebView there would be unideal – Yamaha T64312 Nov 20 '16 at 00:48
  • Have you read the linked solution for android? You don't have to display the WebView, it is just used to provide JavaScript rendered html. You might additionally need to simulate scrolling in the WebView with JavaScript, since other services (like images.google) do lazy loading triggered by scroll events. – Frederic Klein Nov 20 '16 at 08:38
  • I read through the solution and implemented it into a test version for the app, and now I get even less results. On top of that, the results I do get don't include the name or details of the search item, only the price and brand. I don't know if this is the issue, but keep in mind i'm using Google Shopping. @FredericKlein – Yamaha T64312 Nov 20 '16 at 14:44
  • I'll post a result comparison with my method vs your method – Yamaha T64312 Nov 20 '16 at 14:58

0 Answers0