5

I want to load a web page that's generated by JS (e.g., AngularJS or similar) then scrape it using (only) Google Apps Script. How can I accomplish that?

I'm looking for something like:

const response = UrlFetchApp.fetch( urlToExternalJsPage );
const content = response.getContentText();
// scrape content

Only, maybe, replace the UrlFetchApp with come call to a library or something? Perhaps a Puppeteer library for GAS, the Cheerio library for GAS or something else?

How can I load an externally loaded JS page and read the HTML from that page after it's generated in order to scrape it?

Idea 1

I came across this article: The Best Way to Load Javascript that supplies the following code.

function loadScript(url, callback){
  var script = document.createElement("script")
  script.type = "text/javascript";
  if (script.readyState){  //IE
    script.onreadystatechange = function(){
      if (script.readyState == "loaded" || script.readyState == "complete"){
        script.onreadystatechange = null;
        callback();
      }
    };
  } else {  //Others
    script.onload = function(){
      callback();
    };
  }
  script.src = url;
  document.getElementsByTagName("head")[0].appendChild(script);
}

The actual code on your page ends up looking like this:

<script type="text/javascript" src="http://your.cdn.com/first.js"></script>
<script type="text/javascript">
  loadScript("http://your.cdn.com/second.js", function(){
    //initialization code
  });
</script>

The problem with this approach is that I'm trying to stay strictly server side. I'm not trying to post any HTML pages and/or serve them.

Idea 2

I came across this article that appears to describe some Puppeteer Libary for GAS. I translated it from Japanese using Google Translate. The problem is it requires using Google Cloud Platform and I want to avoid that. I also want to avoid setting up any billing and just stay strictly inside Google Apps Script.

Idea 3

Perhaps there is a way to use the browser that comes with the UI service. Specifically, the sidebar?

On this page, I found the following example of importing web pages into an HTML service page using an IFRAME.

Code.gs
function doGet() {
  var template = HtmlService.createTemplateFromFile('top');
  return template.evaluate();
}
top.html
<!DOCTYPE html>
<html>
 <body>
   <div>
     <a href="http://google.com" target="_top">Click Me!</a>
   </div>
 </body>
</html>
Let Me Tink About It
  • 15,156
  • 21
  • 98
  • 207
  • Related: https://stackoverflow.com/q/61579707 https://stackoverflow.com/a/61928025 https://stackoverflow.com/a/50856901 – TheMaster May 28 '20 at 07:08
  • Idea3 won't work. Script tags are rendered client side. Why are you asking questions like "perhaps, this will work?". You already have the code. Why don't you test and let us know? Explain the error and that it doesn't work with [mcve]. Anyway, what's wrong with cheerio? You mentioned it works in a another answer. – TheMaster May 28 '20 at 20:25
  • 1
    @TheMaster: In my experience, CheerioGS works fine for what it's designed to do. Which is to parse the HTML on the server side with syntax familiar to JQuery which otherwise only works on the client side. Cheerio does not, however, solve the problem I am posing here: which is how to get the HTML in the first place. In some AJAX pages, the HTML is created by JS that executes only in the browser. After load and after certain browser events. Web pages built that way make it more difficult to acquire the HTML because they return only the JS in response to, say, a call to `UrlFetchApp.fetch(url)` – Let Me Tink About It May 28 '20 at 22:47
  • @TheMaster: I say "perhaps this will work" because I am trying different solutions one at a time while I am also looking at solutions outside of GAS. If someone has already tried something I'm about to try, it would be useful to learn about it. – Let Me Tink About It May 28 '20 at 22:51

0 Answers0