I am trying to extract text shown on a webpage. I want the equivalent of manually copying the text and pasting it into Notepad (no formatting)
The problem appears to be the text shown on the webpage is generated by a number of scripts and despite my best efforts I am unable to access the text produced by these scripts
The following code downloads the webpage showing the scripts
private fun getWebpageUsingJsoup(url: String): String {
val document = Jsoup.connect(url).get()
document.select("script").forEach {
Log.d(TAG, "Found script $it")
}
return document.body().text()
}
The above code identifies the scripts but does not attempt to execute them
<body>
<script src="runtime.js" type="module"></script>
<script src="polyfills.js" type="module"></script>
<script src="vendor.js" type="module"></script>
<script src="main.js" type="module"></script>
</body>
Following a previous post How to use ScriptEngineManager in Android? I add code to execute the scripts
val engine = ScriptEngineManager().getEngineByName("rhino")
document.select("script[type=module]").forEach { // note: type=module required to avoid the GoogleAnalytics script
val script = Jsoup.parseBodyFragment(it.data())
Log.d(TAG, "Attempting to execute script $it")
val returnValue = engine.eval(script.html())
Log.d(TAG, "Engine returns\n$returnValue")
}
Disappointingly the output of the four scripts is
<html>
<head/>
<body/>
</html>
I am fresh out of ideas and appear no closer to obtaining the text shown on the webpage
Can anyone suggest how to obtain the text ?
For completeness these are the dependencies used to support Jsoup and ScriptEngineManager()
implementation 'org.jsoup:jsoup:1.15.4'
implementation 'io.apisense:rhino-android:1.0'