How to use Google-apps-script to scrape data from a web page that is lazy loaded (via an API)?

Question

I'm trying to create an automated process using Google-apps-script for scraping price data from pages like this one:

https://www.barchart.com/stocks/quotes/$AVVN/price-history/historical

The challenging part is, that the data on the web page is 'lazy loaded', so the 'traditional' scaping methods, that I have used on other web pages, don't work here.

I have considered other ways of solving this problem - but:

Barchart does not provide data for e.g. $AVVN via http: //marketdata.websol.barchart.com/getHistory
I don't want to use the 'Download'-button - as this requires automated login.
ImportXML() does not work (it works for other tables on the web page, but not for the one I want).

I found a similar problem in the following post - that received a very detailed and informative reply from omegastripes: Open webpage, select all, copy into sheet

-but when I run my code:

function test(){
  var url = 'https://www.barchart.com/proxies/core-api/v1/historical/get?symbol=%24AVVN&fields=tradeTime.format(m%2Fd%2Fy)%2CopenPrice%2ChighPrice%2ClowPrice%2ClastPrice%2CpriceChange%2Cvolume%2CsymbolCode%2CsymbolType&startDate=2019-04-15&endDate=2019-07-15&type=eod&orderBy=tradeTime&orderDir=desc&limit=2000&meta=field.shortName%2Cfield.type%2Cfield.description&raw=1'; 
  var options = {
     "muteHttpExceptions": false
  };
  var response   = UrlFetchApp.fetch(url, options);   
  Logger.log(response);
}

-then I get the following error:

Request failed for https://www.barchart.com/proxies/core-api/v1/historical/get?symbol=%24AVVN&fields=tradeTime.format(m%2Fd%2Fy)%2CopenPrice%2ChighPrice%2ClowPrice%2ClastPrice%2CpriceChange%2Cvolume%2CsymbolCode%2CsymbolType&startDate=2019-04-15&endDate=2019-07-15&type=eod&orderBy=tradeTime&orderDir=desc&limit=2000&meta=field.shortName%2Cfield.type%2Cfield.description&raw=1 returned code 500. Truncated server response: <!doctype html> <html itemscope itemtype="http://schema.org/WebPage" lang="en"> <head> <meta charset="UTF-8" /> <meta name="viewport" content="wi... (use muteHttpExceptions option to examine full response) (line 57, file "DS#1")

Basically an "Oops, something's wrong. Our apologies ... there seems to be a problem with this page." ... if you paste the address into your browser.

So my question is: How can data be scraped from this page or has Barchart now succesfully blocked this scraping option?

Andres Duarte · Answer 1 · 2019-07-17T07:58:16.133

The only way i found to obtain the data was using your workaround, getting the request URL to fetch from the console, but additionally you have to add the “x-xsrf-token" and “cookie” headers to the options when using fetch() method [1].

You can get the “x-xsrf-token” and “cookie” request headers from the console as well. Only problem is that the cookies and xsrf-token are valid up to 2 hours, this is because they implemented cross site request forgery protection [2]:

Here is the code i tested and worked:

function testFunction() {
  var url = 'https://www.barchart.com/proxies/core-api/v1/historical/get?symbol=%24AVVN&fields=tradeTime.format(m%2Fd%2Fy)%2CopenPrice%2ChighPrice%2ClowPrice%2ClastPrice%2CpriceChange%2Cvolume%2CsymbolCode%2CsymbolType&startDate=2019-04-16&endDate=2019-07-16&type=eod&orderBy=tradeTime&orderDir=desc&limit=2000&meta=field.shortName%2Cfield.type%2Cfield.description&raw=1';

  var map = {
    "x-xsrf-token": "XXXXX",
    "cookie": "XXXXX"
  }

  var options = {
     "method": "get", 
     "muteHttpExceptions": false,
     "headers": map
  };
  var response = UrlFetchApp.fetch(url, options);   
  Logger.log(response);

  var json = JSON.parse(response);
  Logger.log(json.data[0]);
}

[1] https://developers.google.com/apps-script/reference/url-fetch/url-fetch-app

[2] Difference between CSRF and X-CSRF-Token

Thx Andres for taking the time and effort to reply to my post. When I paste the current ‘cookie’ and 'x-xsrf-token' into the code your provided, then it works very well. I’m only able (via response.getAllHeaders()['Set-Cookie']) to retrieve info about 'market', 'xsrf-token' and 'laravel_session' … but this is for the ‘Response Headers’ – I need the ‘cookie’ and 'x-xsrf-token' from the Request Headers (as shown in your screen shot). So I’m stuck on that part of the solution ... (although it seems to work now and then by using the extracted info from the 'Response Headers' - very strange(??)). — user2628114, Jul 18 '19 at 09:41

score 2 · Answer 2 · answered Jul 17 '19 at 06:44

You cannot use Google Apps Script to lazy-load web pages since they only return the HTML content of the web page and does not wait for the JavaScript to load before returning the content.

One possible solution is to use Google Cloud Functions with Puppeteer that will load the page. The Cloud Function provides a HTTP API that can be directly invoked from Apps Script with the URLFetch service.

How to use Google-apps-script to scrape data from a web page that is lazy loaded (via an API)?

2 Answers2

Linked