I want to load a web page that's generated by JS (e.g., AngularJS or similar) then scrape it using (only) Google Apps Script. How can I accomplish that?
I'm looking for something like:
const response = UrlFetchApp.fetch( urlToExternalJsPage );
const content = response.getContentText();
// scrape content
Only, maybe, replace the UrlFetchApp
with come call to a library or something? Perhaps a Puppeteer library for GAS, the Cheerio library for GAS or something else?
How can I load an externally loaded JS page and read the HTML from that page after it's generated in order to scrape it?
Idea 1
I came across this article: The Best Way to Load Javascript that supplies the following code.
function loadScript(url, callback){
var script = document.createElement("script")
script.type = "text/javascript";
if (script.readyState){ //IE
script.onreadystatechange = function(){
if (script.readyState == "loaded" || script.readyState == "complete"){
script.onreadystatechange = null;
callback();
}
};
} else { //Others
script.onload = function(){
callback();
};
}
script.src = url;
document.getElementsByTagName("head")[0].appendChild(script);
}
The actual code on your page ends up looking like this:
<script type="text/javascript" src="http://your.cdn.com/first.js"></script>
<script type="text/javascript">
loadScript("http://your.cdn.com/second.js", function(){
//initialization code
});
</script>
The problem with this approach is that I'm trying to stay strictly server side. I'm not trying to post any HTML pages and/or serve them.
Idea 2
I came across this article that appears to describe some Puppeteer Libary for GAS. I translated it from Japanese using Google Translate. The problem is it requires using Google Cloud Platform and I want to avoid that. I also want to avoid setting up any billing and just stay strictly inside Google Apps Script.
Idea 3
Perhaps there is a way to use the browser that comes with the UI service. Specifically, the sidebar?
On this page, I found the following example of importing web pages into an HTML service page using an IFRAME
.
function doGet() {
var template = HtmlService.createTemplateFromFile('top');
return template.evaluate();
}
top.html
<!DOCTYPE html>
<html>
<body>
<div>
<a href="http://google.com" target="_top">Click Me!</a>
</div>
</body>
</html>