0

I've created a script to scrape two fields from a webpage using google apps script. The script seems to be working properly but it writes results in a spreadsheet on a per line basis.

What I did is grab the links of all the shops from landing page and then reuse the links within another function to parse shop name and website link from innerpages. Although the two fields are also available in landing page, I wish to parse them from their innerpages.

However, I wish to write the result in the spreadsheet doing batch update. How can I achieve that?

Current approach:

function parseYellowpages() {
  var options = {
    "method" : "GET",
    "headers" : {
        "User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like  Gecko) Chrome/88.0.4324.150 Safari/537.36"
     }
    };
  var ss = SpreadsheetApp.getActiveSpreadsheet();
  var activeSheet = ss.getSheetByName('Sheet1');
  var base = "https://www.yellowpages.com"
  var webURL = "https://www.yellowpages.com/search?search_terms=pizza&geo_location_terms=New+York%2C+NY";

  var row = 1;

  var response = UrlFetchApp.fetch(webURL,options);
  var $ = Cheerio.load(response.getContentText()); 

  const items = $("[class='result'] [class='info']");
  for (i=0;i<items.length;i++){
      var shopLink = base + $(items[i]).find('h2[class="n"] > a.business-name').attr('href');
      var resultContainer;
      resultContainer = getInnerpageInfo(shopLink,options);
      activeSheet.getRange(row,1).setValue(resultContainer[0]);
      activeSheet.getRange(row,2).setValue(resultContainer[1]);
      console.log(resultContainer[0],resultContainer[1]);
      row++;
    };
}

function getInnerpageInfo(innerLink,options) {
    var response = UrlFetchApp.fetch(innerLink,options);
    var $ = Cheerio.load(response.getContentText()); 
    var shopName = $('.sales-info > h1').first().text();
    var website = $('a.website-link').first().attr('href');
    return [shopName,website];

}

How to write result to a spreadsheet doing batch update?

MITHU
  • 113
  • 3
  • 12
  • 41
  • I'm worried that in the Stackoverflow, when the continuous requests are done, an error occurs. So in your situation, if your expected values can be retrieved by Stack Exchange API, I would like to recommend using it. [Ref](https://api.stackexchange.com/docs) But, if you cannot use the API, I apologize. – Tanaike Dec 21 '21 at 11:55
  • Thanks for your comment @Tanaike. The site link that I used in my above script is just a placeholder. I wanted to know the logic of achieving that. However, I picked that site link intentionally thinking that it would be harmless. – MITHU Dec 21 '21 at 12:28
  • 2
    @MITHU Use `example.com` as placeholder. Don't use valid links as placeholder. Your function name also suggests that it's the site you're trying to parse. – TheMaster Dec 21 '21 at 14:09
  • Okay, I've changed the site link within the script now. – MITHU Dec 21 '21 at 14:42
  • The question looks to need more focus. Are you looking for help with fixing the problem to write to the spreadsheet or do you want help to help with batchUpdate? – Rubén Dec 24 '21 at 22:59
  • It's batch update @Rubén. – MITHU Dec 25 '21 at 04:27
  • Duplicate of https://stackoverflow.com/questions/41175326/why-do-we-use-spreadsheetapp-flush – TheMaster Dec 26 '21 at 00:57
  • 2
    Please [edit] the question to make it clear what you are asking, starting by editing the title and by showing what you tried to used batch update. – Rubén Dec 26 '21 at 03:33
  • Check out the edit @Rubén. I think the question that I've posed is in english now. – MITHU Dec 26 '21 at 07:56
  • The title says "Trouble writing result to a spreadsheet using batch update" but the body doesn't show the trouble and the code shown doesn't use batchUpdate. If you haven't done yet, please read https://developers.google.com/apps-script/advanced/sheets. P.S. as the answer already has several upvotes I suggest you to post a new question about `batchUpdate` including a [mcve] showing your attempt to use it. – Rubén Dec 26 '21 at 18:03
  • So, you measure an answer with a number of upvotes, without considering whether the proposed solution works, right? If you think this is the very [output](https://imgur.com/7EqRWZZ) the script should produce, I have nothing more to say @Rubén. I think I'm done answering your questions. Btw, I'm aware when to ask a new question. – MITHU Dec 26 '21 at 22:06

1 Answers1

3

You can dramatically improve performance by replacing repeated .setValue() calls with one .setValues() call that writes all results to the spreadsheet in one batch.

To collect the results, push them into an array one by one in the loop, or replace the loop with Element.map(), like this:

  const result = items.map((index, item) => {
    const path = $(item).find('h2[class="n"] > a.business-name').attr('href');
    const shopLink = base + path;
    return [getInnerpageInfo(shopLink, options)];
  }).get();

To write the results, get a range whose size matches the dimensions of the results and use .setValues(), like this:

  activeSheet.getRange('A1')
    .offset(0, 0, result.length, result[0].length)
    .setValues(result);

The reason why the code in the question does not seem to be doing anything for some time is that Google Apps Script caches writes to the spreadsheet. In many cases, all the writes only get executed when the script terminates.

If you need to apply pending writes immediately, use SpreadsheetApp.flush().

Also see Apps Script best practices and UrlFetchApp.fetchAll().

doubleunary
  • 13,842
  • 3
  • 18
  • 51
  • The solution available [here](https://stackoverflow.com/a/70476851/9189799) addresses in the right way why your provided solution might fail. – MITHU Dec 26 '21 at 10:44
  • [The script](https://www.loom.com/share/67e20bea6f294ffc95dc8cf194f8808a) that is already in my post runs like this. Your suggested [version](https://www.loom.com/share/4f43ec4e43f14638906bf5e5609e3f6b) throws error in the first place. – MITHU Dec 26 '21 at 13:17
  • I've got success using this [code](https://pastebin.com/j9axZyhV). I'm not sure whether this is the right way as I'm very new to it. Your solution produces this [output](https://imgur.com/7EqRWZZ). Thanks. – MITHU Dec 26 '21 at 22:13
  • OK, I see. The reason for the trouble is that the signature of Cheerio (jQuery) `.map()` differs from that of `Array.map()`, and to get an array result, a `.get()` is required. Edited the answer. If you continue to have trouble with `.map()`, just use your new `.each()` solution — that pattern is perhaps not as declarative, but the performance will be essentially the same, as long as you use one `.setValues()` call instead of repeated `.setValue()` calls. Glad that you found a solution. – doubleunary Dec 27 '21 at 01:03
  • Now, your script seems to be working properly when I print the result `console.log(result);` However. it throws this [error](https://pastebin.com/UtSRC84e) pointing at this line `activeSheet.getRange('A1').offset`. I think I understand why it happened (the results are in a single list, not list of lists) but I don't know how to fix it. This is the [result](https://filebin.varnish-software.com/hrah9f66cxg1x6eh/output.txt) that I see in the console. – MITHU Dec 27 '21 at 04:05
  • You current implementation gets success If i try `return [[shopName,website]];` instead of `return [shopName,website];` within `getInnerpageInfo()`. – MITHU Dec 27 '21 at 04:24