0

I would like to read the content of a website into a string.

I started by using jsoup as follows:

private void getWebsite() {
    new Thread(new Runnable() {
        @Override
        public void run() {
            final StringBuilder builder = new StringBuilder();

            try {

                String query = "https://merhav.nli.org.il/primo-explore/search?tab=default_tab&search_scope=Local&vid=NLI&lang=iw_IL&query=any,contains,הארי פוטר";

                Document doc = Jsoup.connect(query).get();
                String title = doc.title();
                Elements links = doc.select("div");

                builder.append(title).append("\n");

                for (Element link : links) {
                    builder.append("\n").append("Link : ").append(link.attr("href"))
                            .append("\n").append("Text : ").append(link.text());
                }
            } catch (IOException e) {
                builder.append("Error : ").append(e.getMessage()).append("\n");
            }

            runOnUiThread(new Runnable() {
                @Override
                public void run() {
                    tv_result.setText(builder.toString());

                }
            });
        }
    }).start();
}

However, the problem is that in this site, when I web browser such as chrome it says in one of it lines:

window.appPerformance.timeStamps['index.html']= Date.now();</script><primo-explore><noscript>JavaScript must be enabled to use the system</noscript><style>.init-message {

So I read that jsoup doesn't have a good solution for this case. Is there any good way to get the element of this page even though that it uses javascript?

EDIT:

After trying the suggestions below, I used webView to load the url and then parsed it using jsoap as follows:

wb_result.getSettings().setJavaScriptEnabled(true);
MyJavaScriptInterface jInterface = new MyJavaScriptInterface();
wb_result.addJavascriptInterface(jInterface, "HtmlViewer");

wb_result.setWebViewClient(new WebViewClient() {
    @Override
    public void onPageFinished(WebView view, String url) {
        wb_result.loadUrl("javascript:window.HtmlViewer.showHTML ('<head>'+document.getElementsByTagName('html')[0].innerHTML+'</head>');");
    }
 });

It did the job and indeed showed me the element. However, still, unlike a browser, it shows some lines as a function and not as a result. For example:

ng-href="{{::$ctrl.getDeepLinkPath()}}"

Is there a way to parse and display the result like in the browser?

Thank you

Ben
  • 1,737
  • 2
  • 30
  • 61
  • 1
    Doesn't look like there's a straightfoward way, but there's [this post](https://stackoverflow.com/questions/17399055/android-web-scraping-with-a-headless-browser). – Phix Nov 03 '20 at 19:53
  • I can help you with this web-site. I like to translate foreign-news web-sites... This one (after clicking translate), appears to be one with **e-books.** I would first ask what you are trying to do. This page has **Java-Script**, and is not likely to give you what you see in a web-browser though a `static-page` **HTML Scrape.** You will need to do more. What do you want from this site? – Y2020-09 Nov 03 '20 at 21:00

1 Answers1

2

I'd suggest looking at the network tab in chrome developer tools and then submitting the request to load up the URL ... you'll see a lot of requests going back/forth.

Two that seem to contain relevant content are:

https://merhav.nli.org.il/primo_library/libweb/webservices/rest/primo-explore/v1/pnxs?blendFacetsSeparately=false&getMore=0&inst=NNL&lang=iw_IL&limit=10&newspapersActive=false&newspapersSearch=false&offset=0&pcAvailability=true&q=any,contains,%D7%94%D7%90%D7%A8%D7%99+%D7%A4%D7%95%D7%98%D7%A8&qExclude=&qInclude=&refEntryActive=false&rtaLinks=true&scope=Local&skipDelivery=Y&sort=rank&tab=default_tab&vid=NLI

which requires a token to access token which comes from:

https://merhav.nli.org.il/primo_library/libweb/webservices/rest/v1/guestJwt/NNL?isGuest=true&lang=iw_IL&targetUrl=https%253A%252F%252Fmerhav.nli.org.il%252Fprimo-explore%252Fsearch%253Ftab%253Ddefault_tab%2526search_scope%253DLocal%2526vid%253DNLI%2526lang%253Diw_IL%2526query%253Dany%252Ccontains%252C%2525D7%252594%2525D7%252590%2525D7%2525A8%2525D7%252599%252520%2525D7%2525A4%2525D7%252595%2525D7%252598%2525D7%2525A8&viewId=NLI

.. which likely requires the JSessoinId which comes from:

https://merhav.nli.org.il/primo_library/libweb/webservices/rest/v1/configuration/NLI

.. so in order to replicate the chain of calls you could use JSoup to make these (and any other relevant) HTTP GET requests, pull out the relevant HTTP headers (typically: session, referer, accept and some other cookie values potentially)

Its not going to be straight forward, but you're essentially looking for a url on the page in one of the JSON responses from one of the network requests:

enter image description here

Once you know which request you want to recreate, you just have to work back up the list of requests and try to recreate them.

This one is not an easy one and would require a lot of time to recreate - my advice if you're going to attempt it, forget trying to parse HTML, try to rebuild/recreate the chain of 3 or so HTTP requests to the back end to get the relevant JSON and parse that. You can often pick apart the website but this ones a big job

Rob Evans
  • 2,822
  • 1
  • 9
  • 15
  • It's like a list of Harry Potter Audio Books (in Hebrew)... That's pretty good work, though (BTW). – Y2020-09 Nov 04 '20 at 00:21
  • @Rob Evans, thansk fo4 the detailed answer. Will give it a try. – Ben Nov 04 '20 at 04:27
  • Could you please explain how for example from the image you posted you understood to get to the 2nd link of the token? like where you saw that it requires token? – Ben Nov 07 '20 at 16:44
  • where you see the Preview tab you also can see the Headers Tab. In the headers you can see Request headers sent and Response headers received. In those headers you will see various values going to the server and another set being returned. These help track sessions, and set various other HTTP/request info. Most often a session cookie is set on opening a website (to track the user's browsing session) and various other details like Referer and tokens used to prevent automated access. These are easily replicated if you know what to look for. I've been building services like this for a while now. – Rob Evans Nov 07 '20 at 19:51
  • If you follow the requests to the server and the responses returned, you'll see a `Cookie:` headers set. Initially these are sent to the browser using `Set-Cookie` in a header response. These are then part of the browsing session and used to verify whether a browser is attempting to access a website or a bot. If you want to appear as a human user, you typically need to replicate the headers in the requests, any Referer set, cookies set, session Ids (if set), HTTP method type. Its all a part of the HTTP protocol and essentially how the WWW works. – Rob Evans Nov 07 '20 at 19:54