0

I'm trying to use Jsoup to get data table from the website: http://aws.amazon.com/ec2/pricing/

I need to get the data from the tables and I'm trying the first table to begin but the page loads the table after some time.

Document doc = Jsoup.connect(html).get();
Elements tableElements = doc.select("table");
Elements tableHeaderEles = tableElements.select("thead tr th");
Elements tableRowElements = tableElements.select(":not(thead) tr");
Instance ins = new Instance();
for (int i = 0; i < tableRowElements.size(); i++) {
    Element row = tableRowElements.get(i);
    System.out.println("row");
    Elements rowItems = row.select("td");
    for (int j = 0; j < rowItems.size(); j++) {
        System.out.println(rowItems.get(j).text());
    }
    System.out.println();
}
nyedidikeke
  • 6,899
  • 7
  • 44
  • 59
  • Add a userAgent and a timeout to your conection. Make sure you are getting the source code correctly. And then try out your css query on this site - http://try.jsoup.org/. When I tried out `thead` on the url, I did not get anything. – LittlePanda Apr 17 '15 at 05:40

2 Answers2

1

The reason why you cannot get the desired contents is that: Some contents are loaded by Ajax, which cannot be aware by the Jsoup.

Please refer to Fetch contents(loaded through AJAX call) of a web page, it shows that HtmlUnit, and etc., will do for you.

Community
  • 1
  • 1
chenzhongpu
  • 6,193
  • 8
  • 41
  • 79
0

Jsoup:

  • Add a userAgent and a timeout to your connection.
  • Make sure you get the source code correctly.
  • Try out your CSS Selector query on http://try.jsoup.org/.

PhantomJSDriver:

If the problem is being caused by Javascript (since JSoup does not support Javascript), then I suggest Selenium + PhantomJSDriver (Ghostdriver), which is used for GUI-less browser automation. With this you can easily navigate through the pages, select elements, submit forms and also perform some scraping. Javascript is also supported.

You can got through the Selenium documentation here. You will have to download phantomjs.exe file.

A good tutorial forPhantomJSDriver is given in here

Config of PhantomJSDriver(from the tutorial):

DesiredCapabilities caps = new DesiredCapabilities();
caps.setJavascriptEnabled(true); // not really needed: JS enabled by default
caps.setCapability(PhantomJSDriverService.PHANTOMJS_EXECUTABLE_PATH_PROPERTY, "C://phantomjs.exe");
caps.setCapability("takesScreenshot", true);
WebDriver driver = new PhantomJSDriver(caps);   
LittlePanda
  • 2,496
  • 1
  • 21
  • 33