1

I'm building a script that uses a looping ImportHTML command to web scrape weather data based on zip code, and am currently running into an issue with the execution timing out every time the script is run.

The current way I have the script set up produces a correct result when run, but given that the script is pulling data from several hundred sources, it is taking a while and will not complete within the current time limit of Google scripts.

The sheet running the script utilizes 3 tabs:

  1. ZIPS, which contains a list of zip code link values pulled from the site that weather data is to be pulled from
  2. Blank, which is simply an intermediary sheet used in the execution of the script
  3. Result, where the final output is to be placed

In order to try and reduce the amount of read/write as much as possible, I changed the code from writing each result of the ImportHTML command as it was executed to appending to an array and writing this array at the end of the script. The code in it's current form is as follows:

function getTemps() {
var googleSheet = SpreadsheetApp.getActive();

// Read in Zip code link values
var sheet = googleSheet.getSheetByName('ZIPS');   
var zipArray = sheet.getDataRange().getValues();
var arrayLength = zipArray.length;

//Set up sheet values
var blankSyntaxA = 'ImportHtml("https://www.wunderground.com/cgi-bin/findweather/getForecast?query=pz:';
var blankSyntaxB = '&zip=1", "table", 1)';
var tempResult = [];


// Writing Section

var sheet = googleSheet.getSheetByName('Blank'); 

for (var i = 0; i < arrayLength; i++)
  { 
     var liveSyntax = blankSyntaxA+zipArray[i][0]+blankSyntaxB;
     sheet.getRange('A1').setFormula(liveSyntax);
     var importedData = sheet.getDataRange().getValues();
     tempResult = tempResult.concat(importedData);
  }  
  var sheet = googleSheet.getSheetByName('Result');
  sheet.getRange(1,1,tempResult.length,8).setValues(tempResult);
}

I know the run time of the script could be reduced by eliminating the read/write contained within the For loop, but I'm not sure how to obtain the necessary HTML table without running the ImportHTML command within the 'Blank' sheet. Is there a way to run that command to fill the 'importedData' array without writing to a sheet?

Alternatively, I had considered utilizing a check on the runtime of the function and implementing a break as it neared the ~5 minute runtime limit, followed by a recursive call back to the original function, but I wasn't sure if this would actually mitigate the runtime issue, or even be possible given the nature of the recursive call.

Any advice on how this script could be modified to run within the script timeout parameter or modified to produce the complete desired outcome with all the necessary imported data would be appreciated. Thanks!

Kos
  • 4,890
  • 9
  • 38
  • 42
AndrewLYM
  • 11
  • 1
  • 4
  • I could be wrong, but isn't the timeout generated because the service in https://www.wunderground.com is taking too much to load? I tried opening it and it hasn't responded for five minutes – Piyin Oct 18 '17 at 16:00
  • I haven't had any issues pulling up the service at wunderground.com in an internet browser, so not sure that it would be something on the service end. However, the biggest holdup to the script running does appear to be the time it takes for ImportHTML commands to load as the for loop is running. So I'm not sure if that would have something to do with the service side, or if that is simply the nature of running repeated command calls and read/writing the results in every iteration of the loop. – AndrewLYM Oct 18 '17 at 16:44
  • Isn't there a way to bring all the data you want with a single call to the service? (Probably passing multiple zip codes or something like that) – Piyin Oct 18 '17 at 16:58
  • I think that's what I'm currently stuck a bit on. The way the site is set up is a list of links for the first 3 digits of zip codes, found here: https://www.wunderground.com/weather-by-zip-code.asp Clicking into a link brings up an html table with weather data for locations matching the first 3 digits of a link, as such: https://www.wunderground.com/cgi-bin/findweather/getForecast?query=pz:228&zip=1 I'm hoping to pull this table, specifically the 'Place' and 'Temp.' columns, for each link, but I'm not sure how to accomplish this with one pull, or if that would even be possible. – AndrewLYM Oct 18 '17 at 18:24
  • But, are you trying to get the information for every zip code or you only need a few? ImportHTML works by receiving `url`, `type_of_html_structure` and `index_of_said_structure_on_the_code`, which allows you to take all available zip codes by calling `ImportHtml("https://www.wunderground.com/weather-by-zip-code.asp", "list", 18)`. Then, you could query each of those starting zips (padding them with zeros to complete the three digits) using your code. That would greatly reduce calls to the service – Piyin Oct 18 '17 at 19:53
  • This line is missing a double-quote at the end: var blankSyntaxA = 'ImportHtml("https://www.wunderground.com/cgi-bin/findweather/getForecast?query=pz:'; – NSchorr Jan 03 '18 at 04:39

1 Answers1

1

Is there a way to run that command to fill the 'importedData' array without writing to a sheet?

IMPORTHMTL is a Google Sheets built-in spreadsheet function. This kind of functions can't be ran / evaluated by Google Apps Script.

Related

Alternatively, I had considered utilizing a check on the runtime of the function and implementing a break as it neared the ~5 minute runtime limit, followed by a recursive call back to the original function, but I wasn't sure if this would actually mitigate the runtime issue, or even be possible given the nature of the recursive call.

Rather than a "mitigator" this is a workaround. There are several techniques like batch processing and parallel processing.

Reference

From answer to Threading in Google App Script

There is a great example from Bruce Mcphearson. His example Parallel Processing in Apps Script uses Map Reduce in exercise. He is utilizing triggers as well, but perhaps may provide some different perspective.

Another alternative is to sign in to the Early Access Program to extend the execution time limit to 30 minutes.

Rubén
  • 34,714
  • 9
  • 70
  • 166