1

I'm new to using jsoup and I am struggling to retrieve the tables with class name: verbtense with the headers: Present and Past, under the div named Indicative from the from this site: https://www.verbix.com/webverbix/Swedish/misslyckas

I have started off trying to do the following, but there are no results from the get go:

Document document = Jsoup.connect("https://www.verbix.com/webverbix/Swedish/misslyckas").get();
Elements tables = document.select("table[class=verbtense]"); // empty

I also tried this, but again no results:

        Document document = Jsoup.connect("https://www.verbix.com/webverbix/Swedish/misslyckas").get();

        Elements divs = document.select("div");


        if (!divs.isEmpty()) {
            for (Element div : divs) {
                // all of these are empty
                Elements verbTenses = div.getElementsByClass("verbtense");
                Elements verbTables = div.getElementsByClass("verbtable");
                Elements tables = div.getElementsByClass("table verbtable");
            }
        }

What am I doing incorrectly?

mr nooby noob
  • 1,860
  • 5
  • 33
  • 56
  • Hello, it's 'table.verbtable" (https://jsoup.org/cookbook/extracting-data/selector-syntax) Element tables = doc.select("table.verbtense").first(); – stacky Mar 07 '21 at 16:54
  • @stacky I tried Document document = Jsoup.connect("https://www.verbix.com/webverbix/Swedish/misslyckas").get(); Element tables = document.select("table.verbtense").first(); but that still results in nothing :/ – mr nooby noob Mar 07 '21 at 17:32

2 Answers2

2

The page you are trying to scrape have dynamically generated content on the client side (with javascript), therfore you won be able to extact data using that link

You might me able to scrape some content from the API call that this webpage is making eg https://api.verbix.com/conjugator/iv1/ab8e7bb5-9ac6-11e7-ab6a-00089be4dcbc/1/21/121/misslyckas

Inspect browser console to see what page is doing, and do the same

enter image description here

Antoniossss
  • 31,590
  • 6
  • 57
  • 99
  • I did 2 things. I have checked the web page source, and Iv checked network tab. – Antoniossss Mar 07 '21 at 17:36
  • Try this (in firefox) to see the webpage source as it is fetched from the server (before JS execution) view-source:https://www.verbix.com/webverbix/Swedish/misslyckas. YOu will see that there is no data you are looking for in the DOM - therfore it must be coming from somewhere else, not the original DOM you are fetching using Jsoup.connect. – Antoniossss Mar 07 '21 at 17:36
  • 1
    Just google "how to scrap javascript site" and there will be a dozen of helpful materials on the topic – Antoniossss Mar 07 '21 at 17:38
  • FYI the correct term is ‘scrape’ - ‘to scrap’ means to throw away (like rubbish) – DisappointedByUnaccountableMod Mar 07 '21 at 20:44
2

The first catch is that this page loads its content asynchronously using AJAX and uses JavaScript to add the content to the DOM. You can even see the loader for a short time. enter image description here

Jsoup can't parse and execute JavaScript so all you get is the initial page :( The next step would be to check what the browser is doing and what is the source of this additional content. You can check it using Chrome's debugger (Ctrl + Shift + i). If you open Network tab, select only XHR communication and refresh the page you can see two requests: enter image description here

One of them gets such content https://api.verbix.com/conjugator/iv1/ab8e7bb5-9ac6-11e7-ab6a-00089be4dcbc/1/21/121/misslyckas as you can see it's a JSON with HTML fragments and this content seems to have verbs forms you need. But here's another catch because unfortunately Jsoup can't parse JSON :( So you'll have to use another library to get the HTML fragment and then you can parse it using Jsoup. General advice to download JSON is to ignore content type (Jsoup will complain it doesn't support JSON):

String json = Jsoup.connect("https://api.verbix.com/conjugator/iv1/ab8e7bb5-9ac6-11e7-ab6a-00089be4dcbc/1/21/121/misslyckas").ignoreContentType(true).execute().body();

then you'll have to use some JSON parsing library for example json-simple to obtain html fragment and then you can parse it to HTML with Jsoup:

String json = Jsoup.connect(
    "https://api.verbix.com/conjugator/iv1/ab8e7bb5-9ac6-11e7-ab6a-00089be4dcbc/1/21/121/misslyckas")
    .ignoreContentType(true).execute().body();
System.out.println(json);
JSONObject jsonObject = (JSONObject) JSONValue.parse(json);
String htmlFragmentObtainedFromJson = (String) ((JSONObject) jsonObject.get("p1")).get("html");
Document document = Jsoup.parse(htmlFragmentObtainedFromJson);
System.out.println(document);

Now you can try your initial approach with using selectors to get what you want from document object.

Krystian G
  • 2,842
  • 3
  • 11
  • 25