I am trying to parse a webpage and extract data using Jsoup. But the link is dynamic and throws up a wait-for-loading page before displaying the details. So the Jsoup seems to process the waiting page rather than the details page. is there anyway to make this wait till page is fully loaded?
-
1Can you please add the urls and the real example? – Davide Pastore Mar 20 '16 at 11:04
-
You can try ui4j instead of Jsoup here: https://github.com/ui4j/ui4j. – Stephan Mar 21 '16 at 10:06
3 Answers
If some of the content is created dynamically once the page is loaded, then your best chance to parse the full content would be to use Selenium with JSoup:
WebDriver driver = new FirefoxDriver();
driver.get("http://stackoverflow.com/");
Document doc = Jsoup.parse(driver.getPageSource());

- 41,537
- 7
- 86
- 101
-
Here also, if the webpage i am trying to parse has java scripts then it doesn't wait for the execution and i get a waiting or loading page. So probably waiting for elements to load instead of using jsoup to immediately parse might be a better way. that seems to have worked for me. may be since my initial question included Jsoup usage, your answer also included jsoup. may be I should have worded my question better. thanks! – Thiru Apr 06 '16 at 05:13
Probably, the page in question is t generated by JavaScript in the browser (client-side). Jsoup does not interpret JavaScript, so you are out of luck. However, you could analyze the page loading in the network tab of the browser developer tools and find out which AJAX calls are made during page load. These calls also have URLs and you may get all infos you need by directly accessing them. Alternatively, you can use a real browser engine to load the page. You can use a library like selenium webdriver for that or the JavaFX webkit component if you are using Java 8.

- 11,497
- 6
- 38
- 53
I think i am just expanding luksch's answer a bit more. I am not familiar with web frameworks, so the answer looked a little difficult to understand. Since page was loading dynamically using a parser like Jsoup is difficult since we must know that all the elements are loaded completely before attempting a parsing. So instead of parsing immediately, use the webdriver(selenium) to check for elements status and once they are loaded, get the page source and parse or use the webdriver itself to gather the data needed instead of using a separate parser.
WebDriver driver = new ChromeDriver();
driver.get("<DynamicURL>");
List<WebElement> elements = null;
while (elements == null)
{
elements = driver.findElements(By.className("marker"));
if (!valuePresent(elements))
{
elements = null;
}
}
if (elements != null)
{
processElements(elements);
}

- 157
- 1
- 2
- 11