3

I want to scrape a website but when I connect to it using Jsoup.connect(url) only a part of the page is loaded.

When I downloaded the page as html I saw that in one part of the page there is only a loader icon so I concluded that that part of the page is loaded afterwards from some other source.

The funny thing is that inspect element contains the missing html and view page source doesn't. HTML loaded from jSoup is basically the same as when opened from "view page source".

Is there a way to bypass this and to load the whole page as it is displayed in browser?

The page in question is this: https://www.oddsportal.com/tennis/australia/atp-australian-open-2017/results/page/1/

Ask for any additional information I could provide.

===============

EDIT: I am connecting to url like this:

Document doc = null;

try {
    doc =  Jsoup.connect(url).get();
} catch (IOException e) {
    e.printStackTrace();
}

I am getting this div using css selector:

Elements tournamentTable = doc.select("div[id=tournamentTable]");

Content of tournamentTable is <div id="tournamentTable"></div>

wdc
  • 2,623
  • 1
  • 28
  • 41
  • What part of the page is loading? What does your code look like? Please edit your question and add these details. – Cardinal System Jan 08 '19 at 23:10
  • @CardinalSystem Div with `id=tournamentTable` is empty when loaded from jSoup. – wdc Jan 08 '19 at 23:10
  • @CardinalSystem Edited, but I don't think it's code related as the source code of this page also doesn't contain anything in this div. I can only see this div if I inspect some element (in chrome) inside the div. – wdc Jan 08 '19 at 23:31
  • The data is being injected by Javascript. You'll need to wait for the page to fully load, then pull its contents. Or query its backing API directly. – Roddy of the Frozen Peas Jan 08 '19 at 23:34
  • 1
    Turn off JavaScript support for that page in your browser and see how it looks. That is what Jsoup has to worth with. It is not browser emulator with JavaScript support so you would need to use other tools. See duplicate (link at top of your question) for suggestions. – Pshemo Jan 08 '19 at 23:42

1 Answers1

7

It seems id=tournamentTable is generated dynamically using javascript. JSoup is not evaluating javascript, so you'd have to use library like HtmlUnit. For example:

WebClient webClient = new WebClient(BrowserVersion.CHROME);
webClient.getOptions().setJavaScriptEnabled(true); // enable javascript
webClient.getOptions().setThrowExceptionOnScriptError(false); //even if there is error in js continue
webClient.waitForBackgroundJavaScript(5000); // important! wait until javascript finishes rendering
HtmlPage page = webClient.getPage(url);

page.getElementById("tournamentTable");
Krzysztof Atłasik
  • 21,985
  • 6
  • 54
  • 76