3
import java.io.IOException;
import java.util.ArrayList;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Node;
import org.jsoup.select.Elements;


public class listGrabber {
    public static void main(String[]args) {
        try {
            Document doc = Jsoup.connect("https://play.google.com/store/apps/category/GAME_ACTION/collection/topselling_free").get();
            int count = 0;
            Elements elements;
            String url;
            ArrayList<String> list = new ArrayList<>();
            do{
                elements = doc.select("a[class^=title]").get(count).select("a[class^=title]");

                url = "";
                url = elements.attr("abs:title").replaceAll("https://play.google.com/store/apps/category/GAME_ACTION/collection/","");
                url = url.replaceAll("®|™","");
                url = url.replaceAll("[(](.*)[)]","");
                list.add(url);
                System.out.println(url);
                count++;
            }while (url!="" &&url!=null);
            // String divContents =
            // doc.select(".id-app-orig-desc").first().text();
            // elements.remove("div");
        } catch (IOException e) {

        }
    }
}

As you can see above, I'm trying to grab a list of words from https://play.google.com/store/apps/category/GAME_ACTION/collection/topselling_free

The google play store page loads more elements every time you scroll to the bottom of the page.

My program will grab the first 40ish elements that show up but since jsoup doesn't load the rest of the webpage that loads dynamically, I can't grab any of the elements beyond the first 40.

Furthermore, if you scroll on the page to game #300, a Show More button appears, I'd also like to parse the elements beyond the show more button.

Is there any way for Jsoup to parse all the elements that would dynamically load on the page?

Vic
  • 93
  • 1
  • 9
  • No. Extra elements are loaded by JavaScript, and JSoup does not support executing JavaScript. – yole Jul 27 '15 at 19:55
  • @yole Is there another library I can use instead of Jsoup that might work? – Vic Jul 27 '15 at 19:56
  • You could figure out the ajax call the web page makes when you scroll to the bottom of the page and then make the call yourself repeatedly with any arguments it requires. That should work as long as you have a decent url to work with. – Robert Moskal Jul 27 '15 at 20:42
  • Or use selenium web driver. – Alkis Kalogeris Jul 28 '15 at 07:08
  • Possible duplicate of [Page content is loaded with javascript and Jsoup doesn't see it](http://stackoverflow.com/questions/7488872/page-content-is-loaded-with-javascript-and-jsoup-doesnt-see-it) – Vic Seedoubleyew Aug 19 '16 at 21:55

1 Answers1

7

EDIT - After few comments from the OP, I understood exectly what he wants to acheive. I've changed a bit my original solution and tested it.

You can do it with JSOUP. After the first page, getting the next one requiers you to sen a post request with some headers. The headers contains (among other) the start number and how many records to get. If you send an illegel number (i.e. you ask the page that contains game number 700 but the results contain only 600 games), you get the first page again. You can loop thru the pages, until you get a result that you already have.
Sometimes the server returns 600 results and sometimes only 540, I could not figure why.
The code for that is -

import java.util.regex.Pattern;
import org.jsoup.Connection;
import org.jsoup.Connection.Method;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;

public class HelloWorld {

public static void main(String[] args) {

    Connection.Response res = null;
    Document doc = null;
    Boolean OK = true;
    int start = 0;
    String query;
    ArrayList<String> tempList = new ArrayList<>();
    ArrayList<String> games = new ArrayList<>();
    Pattern r = Pattern.compile("title=\"(.*)\" a");

    try {   //first connection with GET request
        res = Jsoup.connect("https://play.google.com/store/apps/category/GAME_ACTION/collection/topselling_free")
                .method(Method.GET)
                .execute(); 
        doc = res.parse();
    } catch (Exception ex) {
        //Do some exception handling here
    }
    for (int i=1; i <= 60; i++) {    //parse the result and add it to the list
        query = "div.card:nth-child(" + i + ") > div:nth-child(1) > div:nth-child(3) > h2:nth-child(2) > a:nth-child(1)";
        tempList.add(doc.select(query).toString());
    }

    while (OK) {    //loop until you get the same results again
        start += 60;    
        System.out.println("now at number " + start);
        try {      //send post request for each new page
            doc = Jsoup.connect("https://play.google.com/store/apps/category/GAME_ACTION/collection/topselling_free?authuser=0")
                    .cookies(res.cookies())
                    .data("start", String.valueOf(start))
                    .data("num", "60")
                    .data("numChildren", "0") 
                    .data("ipf", "1")
                    .data("xhr", "1")
                    .post();
        } catch (Exception ex) {
            //Do some exception handling here
        }
        for (int i=1; i <= 60; i++) {    //parse the result and add it to the list
            query = "div.card:nth-child(" + i + ") > div:nth-child(1) > div:nth-child(3) > h2:nth-child(2) > a:nth-child(1)";
            if (!tempList.contains(doc.select(query).toString())) {
                tempList.add(doc.select(query).toString());
            } else {    //we've seen these games before, time to quit
                OK = false;
                break;
            }               
        }   
    }
    for (int i = 0; i < tempList.size(); i++) {    //remove all redundent info.
        Matcher m = r.matcher(tempList.get(i));
        if (m.find()) {
            games.add(m.group(1));
            System.out.println((i + 1) + " " + games.get(i));
        }           
    }
}
}

The code can be further improved (like handling all the lists at a seperate method), so it's up to you.
I hope this does the work for you.

TDG
  • 5,909
  • 3
  • 30
  • 51
  • Why did you use Connection.Response as opposed to Document though, is it cause Document can't be modified but Connection.Response allows for POST requests? Also, what does GetWords do? And why is .cookies(res.cookies()) necessary in this, does it just prevent the connection from refreshing? – Vic Jul 29 '15 at 03:49
  • Getting this kind of output with this method: http://i.imgur.com/3YozLin.png I changed up some things a little but I get the same amount of values in the arraylist regardless (523, 522 without duplicates) The number displayed is the amount of values in the array at each point in time. The first and 61st value as well as the 60th and 120th value are both in the arraylist so I'm quite lost as to which ones aren't being placed into the list. – Vic Jul 29 '15 at 04:27
  • `GetWords` is the method for filtering the results. Maybe I copied it wrong from your question. I put it in a seperate method because it's make more sense to seperate tasks. I use `Response` for the first `GET` request and then `connect` for the `POST`s - I think I saw it on some tutorial. As for the results you get - you didn't explain in your question what exectly you are tring to achieve, so I can't tell you what's wrong. – TDG Jul 29 '15 at 06:52
  • I'm attempting to grab every single game name on the google play store page, it totals to either 540 or 600 games(not sure why it varies). Using the method you wrote I ended up with 522 so I end up missing 18 and it seems that i.imgur.com/3YozLin.png (second number is list.size at each point in time) it's not missing 18 in one chunk but rather missing some in each section that it parses through – Vic Jul 29 '15 at 15:24
  • Thanks so much for the code, taking quite a while to study through it. I can't seem to understand where the value for query comes from and what the value "div.card:nth-child(" + i + ") > div:nth-child(1) > div:nth-child(3) > h2:nth-child(2) > a:nth-child(1)"; means though. Why did you use pattern and match this time instead of doc.select().get? – Vic Jul 30 '15 at 00:46
  • `i` varies between 1 to 60, so we get the query `div.card:nth-child(i) > div:nth-child(1) > div:nth-child(3) > h2:nth-child(2) > a:nth-child(1)`. I got this value from the browser's developers tools - it has an `inspector` which shows each elements query string. I find it very convinient to use that tool, but if you jave another query that works for you, that fine. PS - if my solution works for you, please mark the answer as "accepted". – TDG Jul 30 '15 at 04:33
  • what do you mean by inspector? Can't seem to find the function :/ – Vic Aug 01 '15 at 16:49
  • It's not a `jsoup` function... It's part of the browser. Press `F12` and you will open `developer tools`. It shows lots of information about the displayed page. It has an `inspector` - you choose an element from the page, and it can show you the `query` needed to get that element. – TDG Aug 02 '15 at 15:57