1

Today I started "to play" with JSoup. I wanted to know how much powerful JSoup is, so I looked for a webpage with a lot of elements and I tried to retrieve all of them. And I found what I was looking for: http://www.top1000.ie/companies.

This is a list with a lot of elements (1000) that are similar (each company of the list). Just change the text inside of them so what I have tried to retrieve it is that text, but I am only able to get the first 20 elements, not the rest.

This is my simple code:

package retrieveInfo;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class Retrieve {

    public static void main(String[] args) throws Exception{
        String url = "http://www.top1000.ie/companies";
        Document document = Jsoup.connect(url)
                 .userAgent("Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0")
                 .timeout(1000*5)
                 .get();

        Elements companies = document.body().select(".content .name");
        for (Element company : companies) {
            System.out.println("Company: " + company.text());
        }
    }

}

I though that it could be that the page did not have time to load, so it is the reason why I put .timeout(1000*5) to wait 5 seconds but I only can get the first 20 elements of the list.

Does JSoup have a limit of elements that you can retrieve from a webpage? I think it should not because it seems that it is prepared for that purpose so I think I am missing something in my code.

Any help would be appreciated. Thanks in advance!

Francisco Romero
  • 12,787
  • 22
  • 92
  • 167

2 Answers2

7

NEW ANSWER:

I looked at the website you are trying to parse. the problem is, that only the first 20 comanpies are loaded with the first call of the site. the rest ist loaded via AJAX. And Jsoup does not interpret or run JavaScript. You can use selenium webdriver for that, or figure out the AJAX calls directly.

OLD:

Jsoup limits to 1M, if not told otherwise via the maxBodySize() method. So you may want to do this:

Document document = Jsoup.connect(url)
             .userAgent("Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0")
             .maxBodySize(0)
             .timeout(1000*5)
             .get();

Beware, the above turns off the size limit altogether. This may not be a good idea, since Jsoup builds the DOM in memory, so you may run into problems with memory heap size for big documents. If you do have problems like this, it may help to switch to another SAX based HTML parser.

Community
  • 1
  • 1
luksch
  • 11,497
  • 6
  • 38
  • 53
  • Why do you set it to 0? I also tried adding 2MB to `maxBodySize()` before but I only gets the same 20 first elements. Also with your solution. – Francisco Romero Apr 19 '16 at 15:04
  • Furthermore, being honest, I do not think that 20 phrases could occupy 1MB of capacity. – Francisco Romero Apr 19 '16 at 15:08
  • Please see my revised answer. – luksch Apr 19 '16 at 15:16
  • Thank you for the additional info. Now I could see with both answers that the problem it is on the AJAX (I did not notice it before). What could be the best library to get this? – Francisco Romero Apr 19 '16 at 15:25
  • The easiest is probably the way @nyname00 describes. So using the POST requests with a page parameter will give you the JSON answers with the embedded html. After parsing the JSON, you probably can still feed the html property into JSoup to parse the content you your are interested in. – luksch Apr 19 '16 at 15:33
  • Thank you! I will take a look about it. – Francisco Romero Apr 19 '16 at 15:43
  • I do not like people who say "please upvote my answer". Just in case of it is the best and the OP does not accept it yet. But in this case, who cares, you helped me also to take a look around it. – Francisco Romero Apr 21 '16 at 10:26
2

The site initially loads only the first 20 elements. When you scroll down the next block of elements is loaded by a script (a POST to http://www.top1000.ie/companies?page=2). The script then adds the received elements to the DOM.

However, the response you get from a POST to /companies?page= is a JSON.

{
 "worked":true,
 "has_more":true,
 "next_url":"/companies?page=3",
 "html":"..."
 ...
}

Here the "html" field seems to contain the elements that will be added to the DOM.

Using Jsoup to get the data will be tedious, because Jsoup will add all kind of tags around the actual JSON and will also escape certain characters.

I think you would be better off using one of the ways described in this post, connect to http://www.top1000.ie/companies?page=1 and read the data page by page.

Edit here's a minimal example on how you could approach this problem using HttpURLConnection and the minimal-json parser.

void readPage(int page) throws IOException {
    URL url = new URL("http://www.top1000.ie/companies?page=" + page);

    HttpURLConnection connection = (HttpURLConnection) url.openConnection();
    connection.setDoOutput(true);
    connection.setRequestMethod("POST");

    try (OutputStreamWriter writer = new OutputStreamWriter(connection.getOutputStream())) {
        // no need to post any data for this page
        writer.write("");
    }

    if (connection.getResponseCode() == HttpURLConnection.HTTP_OK) {
        try (Reader reader = new InputStreamReader(connection.getInputStream())) {
            String html = Json
                .parse(reader)
                .asObject()
                .getString("html", "");

            Elements companies = Jsoup
                .parse(html)
                .body().select(".content .name");

            for (Element company : companies) 
                System.out.println("Company: " + company.text());
        }
    } else {
        // handle HTTP error code.
    }
}

Here we use HttpURLConnection to send a POST request (without any data) to the URL, use the JSON parser to get the "html" field from the result and then parse it using Jsoup. Just call the method in a loop for the pages you want to read.

Community
  • 1
  • 1
nyname00
  • 2,496
  • 2
  • 22
  • 25
  • Nice analysis and a bit faster than my revised answer. +1 :) – luksch Apr 19 '16 at 15:15
  • Thank you for your anaylisis but I have some doubts: 1. How did you get that JSON? 2. What of the ways described in the other post do you think it is the best? Thank you again! – Francisco Romero Apr 19 '16 at 15:23
  • @Error404 I was using Chrome Developer Tools to inspect network traffic. If you have some experience with Selenium or Webdriver (as @ luksch suggested), you can give them a try, but a simple http request using a JSON parser would be my first choice – nyname00 Apr 19 '16 at 15:33
  • @nyname00 I am totally new at developing of web applications so maybe I am missunderstanding you. Do you mean to use `HttpURLConnection`? http://docs.oracle.com/javase/7/docs/api/java/net/HttpURLConnection.html. I am sorry but I do not know how to put links on a word in comments. – Francisco Romero Apr 19 '16 at 15:41
  • @Error404 yep, or Apache HttpClient, both will do. See [here](http://www.mkyong.com/java/how-to-send-http-request-getpost-in-java/) for a tutorial – nyname00 Apr 19 '16 at 16:44
  • @nyname00 I got again with `HttpURLConnection` the first 20 companies but I cannot figure out how to get the rest of the companies with a JSON parser. – Francisco Romero Apr 20 '16 at 10:11
  • @nyname00 Thank you for the example but I am having troubles to add this `JSON parser` to my `Eclipse` project. I copy all the files on my project but it does not work. – Francisco Romero Apr 20 '16 at 11:50
  • @Error404 could you get a bit more specific? – nyname00 Apr 20 '16 at 11:59
  • @nyname00 Sorry, I feel so dumb. I was getting the wrong `.jar` file. Now I am going to prove your code ^^ – Francisco Romero Apr 20 '16 at 12:05
  • @nyname00 It works like a charm. Thank you very much! – Francisco Romero Apr 20 '16 at 12:19