3

I'm trying to using the jsoup library to get 'li' from a website. The problem is this:

  • If I open the source of website with CTRL+U(which is the same read by jsoup), the 'ul' tag is hidden.

hidden result

  • if I open the code with the fuction "inspect code" of google chrome,'li' are shown.

shown result

Posting the code is not necessary; I only want to know how can access to this 'li' with jsoup or other java free libraries, Whereas in the source code(and through jsoup) these informations are hidden.

The site is https://farmaci.agenziafarmaco.gov.it/bancadatifarmaci/cerca-farmaco and try to search something(i.e. Tachi)

Fidelis
  • 91
  • 1
  • 11
  • Can you post at least the link to the website? – Shakhar Dec 31 '16 at 15:53
  • 1
    It is hard to help you without ability to reproduce the problem. What if they are different potential causes of this situation and each one of them should be solved differently. Answer for your question could require writing quite nice article. Please [edit] your question and include minimal amount of information which will actually let us reproduce this problem. – Pshemo Dec 31 '16 at 15:57
  • @Shakhar posted:-) – Fidelis Jan 01 '17 at 17:21
  • @Pshemo The problem is reproduce every time you search something in The site above – Fidelis Jan 01 '17 at 17:22

2 Answers2

1

The problem with Jsoup is that it won't handle scripts. It is just getting html as it is before the AJAX code is executed.

You can use something like HtmlUnit, which is basically a GUI-less browser. So, it can handle scripts.

You can try something like this after getting the HtmlUnit library:

    String url = "https://farmaci.agenziafarmaco.gov.it/bancadatifarmaci/cerca-farmaco?search=Tachi";
    try(final WebClient webClient = new WebClient()) {
        final HtmlPage page = webClient.getPage(url);
        final HtmlUnorderedList list = page.getHtmlElementById("ul_farm_results");
        System.out.println(list.asText());
    }

I couldn't check the code as the website's certificate is improperly configured and I didn't want to import it's certificate. You may want to take a look at this to resolve the certificate errors.

Community
  • 1
  • 1
Shakhar
  • 432
  • 3
  • 12
  • 1
    Your solution works, thank you! I add WebClientOptions wco = webClient.getOptions(); and wco.setUseInsecureSSL(true); to avoid certificate. How can I speed up the code? @Shakhar – Fidelis Jan 02 '17 at 22:16
0

JSoup does not execute all the scripts, it just gets the HTML returned by the server. What you are looking for is call rendered HTML, that is the HTML produced by the browser after executing all the scripts.

The best solution in Java is to use Selenium with your preferred browser. Selenium was developed for UI testing, it is however very popular as a scraping tool.

A good getting started page is to be found here.

Some code example with Firefox:

WebDriver driver = new FirefoxDriver();
driver.get("https://farmaci.agenziafarmaco.gov.it/bancadatifarmaci/cerca-farmaco");
// Find the element
String id = "ul_farm_results";
WebElement element = driver.findElement(By.id(id));
Julien
  • 1,302
  • 10
  • 23