0

I am trying to extract text shown on a webpage. I want the equivalent of manually copying the text and pasting it into Notepad (no formatting)

The problem appears to be the text shown on the webpage is generated by a number of scripts and despite my best efforts I am unable to access the text produced by these scripts

The following code downloads the webpage showing the scripts

private fun getWebpageUsingJsoup(url: String): String {
    val document = Jsoup.connect(url).get()
    document.select("script").forEach {
        Log.d(TAG, "Found script $it")
    }
    return document.body().text()
}

The above code identifies the scripts but does not attempt to execute them

<body>
<script src="runtime.js" type="module"></script>
<script src="polyfills.js" type="module"></script>
<script src="vendor.js" type="module"></script>
<script src="main.js" type="module"></script>
</body>

Following a previous post How to use ScriptEngineManager in Android? I add code to execute the scripts

val engine = ScriptEngineManager().getEngineByName("rhino")
document.select("script[type=module]").forEach { // note: type=module required to avoid the GoogleAnalytics script
    val script = Jsoup.parseBodyFragment(it.data())
    Log.d(TAG, "Attempting to execute script $it")
    val returnValue = engine.eval(script.html())
    Log.d(TAG, "Engine returns\n$returnValue")
}

Disappointingly the output of the four scripts is

<html>
<head/>
<body/>
</html>

I am fresh out of ideas and appear no closer to obtaining the text shown on the webpage

Can anyone suggest how to obtain the text ?

For completeness these are the dependencies used to support Jsoup and ScriptEngineManager()

implementation 'org.jsoup:jsoup:1.15.4'
implementation 'io.apisense:rhino-android:1.0'
RatherBeSailing
  • 261
  • 1
  • 11

1 Answers1

0

While this is not (yet) a complete answer I have managed to get the website text

The process starts with WebView with JavaScript enabled

When the page loads the onPageFinished() method invokes evaluateJavascript() to extract the document.body.textContent

val webView = findViewById<WebView>(R.id.webView)
webView.getSettings().setJavaScriptEnabled(true)
webView.webViewClient = object : WebViewClient() {
    override fun onPageFinished(view: WebView?, url: String?) {
        super.onPageFinished(view, url)
        webView.evaluateJavascript("document.body.textContent") { text ->
        Log.d(TAG, "The website text is $text")
    }
}
webView.loadUrl(downloadURL)

The solution is not perfect because it removes ALL formatting information, including the line feeds which are included when manually copying and pasting the webpage

Can anyone suggest improvements to retain the formatting?

RatherBeSailing
  • 261
  • 1
  • 11