6

I have a Google Sheet with a custom function formula that: takes in a matrix and two vectors from a spreadsheet, does some lengthy matrix-vector calculations (>30 sec, so above the quota), before outputting the result as a bunch of rows. It is single-threaded, since that's what Google Apps Script (GAS) natively is, but I want to parallelize the calculation using a multi-threading workaround, so it speeds it up drastically.

Requirements (1-3):

  1. UX: It should run the calculations automatically and reactively as a custom function formula, which implies that the user doesn't have to manually start it by clicking a run button or similar. Like my single-threaded version currently does.

  2. Parallelizable: It should ideally spawn ~30 threads/processes, so that instead of taking >30 seconds as it now does (which makes it time out due to Google's quota limit), it should take ~1 second. (I know GAS is single-threaded, but there are workarounds, referenced below).

  3. Shareability: I should ideally be able to share the Sheet with other people, so they can "Make a copy" of it, and the script will still run the calculations for them:

  • 3.1 Permissionless: Without me having to manually hand out individual permissions to users (permissionless). For instance whenever someone "Makes a copy" and "Execute the app as user accessing the web app". My rudimentary testing suggest that this is possible.
  • 3.2 Non-intrusive: Without users of the spreadsheet having to give intrusive authorizations like "Give this spreadsheet/script/app access to your entire Google Drive or Gmail account?". Users having to give an non-intrusive authorization to a script/webapp can be acceptable, as long as requirement 3.1 is still maintained.
  • 3.3 UX: Without forcing users to view a HTML sidebar in the spreadsheet.

I have already read this excellent related answer by @TheMaster which outlines some potential ways of solving parallelization in Google Apps script in general. Workaround #3 google.script.run and workaround #4 UrlFetchApp.fetchAll (both using a Google Web App) looks most promising. But some details are unknown to me, such as if they can adhere to requirements 1 and 3 with its sub-requirements.

I can conceive of an other potential naïve workaround which would be to split the function up into several custom functions formulas and do the parallelization (by some kind of Map/Reduce) inside the spreadsheet itself (storing intermediary results back into the spreadsheet, and having custom function formulas work on that as reducers). But that's undesired, and probably unfeasible, in my case.

I'm very confident my function is parallelizable using some kind of Map/Reduce process. The function is currently optimized by doing all the calculations in-memory, without touching the spreadsheet in-between steps, before finally outputting the result to the spreadsheet. The details of it is quite intricate and well over 100 lines, so I don't want to overload you with more (and potentially confusing) information which doesn't really affect the general applicability of this case. For the context of this question you may assume that my function is parallelizable (and map-reduce'able), or consider any function you already know that would be. What's interesting is what's generally possible to achieve with parallelizationin Google Apps Script, while also maintaining the highest level of shareability and UX. I'll update this question with more details if needed.

Update 2020-06-19:

To be more clear, I do not rule out Google Web App workarounds entirely, as I haven't got experience with their practical limitations to know for sure if they can solve the problem within the requirements. I have updated the sub-requirements 3.1 and 3.2 to reflect this. I also added sub-req 3.3, to be clearer on the intent. I also removed req 4, since it was largely overlapping with req 1.

I also edited the question and removed the related sub-questions, so it is more focused on the single main HOWTO-question in the title. The requirements in my question should provide a clear objective standard for which answers would be considered best.

I realise the question might entail a search for the Holy Grail of Google Sheet multithreading workarounds, as @TheMaster has pointed out in private. Ideally, Google would provide one or more features to support multithreading, map-reduce, or more permissionless sharing. But until then I would really like to know what is the optimal workaround within the current constraints we have. I would hope this question is relevant to others as well, even considering the tight requirements.

Magne
  • 16,401
  • 10
  • 68
  • 88
  • 2
    I'm guessing that this is the wrong forum for you. Perhaps you could make this a feature request directly to Google. – Cooper Jun 18 '20 at 18:44
  • 2
    To do gas multithreading you need to be using the webapp functionality, not a custom function. In that case you use the webapp side to break up the job into chunks and then spawn threads on the server side to do the work and they send it back. If you need a solution in the spreadsheet, you could try pairing an onEdit(e) that checks the value of the e.value for your "formula", parse out the variables, and then do a single stranded process that runs for 20 seconds, saves its state to a scriptproperty, and creates a trigger to continue from the saved state in a minute. – J. G. Jun 18 '20 at 18:49
  • 2
    If you want to know if that's possible - not really (unless you use a Web App as mentioned, but that's a workaround), after all, this is JavaScript without web workers of DOM or processes of Node.js. Could we still see the code (since you mention the size - I think a gist will do just fine)? – Oleg Valter is with Ukraine Jun 18 '20 at 20:07
  • Time-driven self-rescheduling triggers are a good alternative, but they are notoriously unreliable, but if you don't mind handling occasional random firing you can use them. Additionally, you will have to acount for maximum number of triggers per project. – Oleg Valter is with Ukraine Jun 18 '20 at 20:16
  • 3
    If your custom function is well optimized, you'd probably need a server to process the load. Ideally, if you publish a web-app with "anyone, even anonymous", then the custom function can use fetchAll to post to that web-app. This will run paralelly. Caveat here is if multiple people use the sheet, and the custom function will have to post to the same webapp for processing and Google will limit simultaneous executions. Workarounds:#1 ask people to publish their own web-apps; #2: host a server for the load – TheMaster Jun 18 '20 at 22:05
  • 3
    Again, you might not need any of this, If the code can be optimized internally. – TheMaster Jun 18 '20 at 22:06
  • @J.G. "To do gas multithreading you need to be using the webapp functionality, not a custom function." → But a custom function needs to do the request/response to the webapp, no? – Magne Jun 19 '20 at 17:18
  • 1
    @J.G. and @OlegVa and @TheMaster : Thanks for your suggestions. The goal is not primarily to defeat the quota, but to provide an as fast an efficient execution as possible, to preserve the UX. I think multi-threading would be needed as I am trying to brute force an exponential function (which has a fixed upper bar, `x`), namely all possible variations of a fixed size binary input vector with `2^x` possibilities. (I currently can't conceive of a more optimal algorithm, due to the function's irratic non-linearity. I've tried evolutionary optimisation, but it gave no guarantees). – Magne Jun 19 '20 at 17:31
  • 1
    Magne, thank you for responding. I asked for code because it might affect the suggestions, but you gave a clear explanation of the idea. Regarding custom function's access to a call to the Web App - it can do that as `UrlFetchApp` is one of the whitelisted services (although I would to check how it correlates with the fact that it usually requires authorization). @TheMaster - maybe move your answer here as well and extend it somewhat - I can see the potential to build a canonical Q&A on parallelization in GAS here, what do you think? Overall - the conv. is getting lengthy, maybe a chat room? – Oleg Valter is with Ukraine Jun 20 '20 at 04:50
  • 1
    Also, `google.script.run` won't work inside the spreadsheet unless you use a dialog or a sidebar, which is what 3.3. says you want to avoid (and you probably already knew that). But the `UrlFetchApp.fetch()` will - how about an `onEdit` trigger? As for the authorization, explicit scopes will help you to granulate permissions needed (+the only one would probably be "make external requests"). With the Web App workaround you must also be aware of quotas which, given your use case, may get exhausted. Finally - maybe worth posting the function code on Code Review? – Oleg Valter is with Ukraine Jun 20 '20 at 04:59
  • 1
    Oh, and another suggestion - move the function that does the heavy lifting to Cloud Functions, set up an HTTP trigger and call it with `UrlFetchApp.fetch()`. This way, you can use Node.js to fork processes *and* be able to use the custom function. – Oleg Valter is with Ukraine Jun 20 '20 at 08:53
  • 2
    @OlegValter Converted my comment to answer. Custom functions don't need authorization to access external apis. – TheMaster Jun 20 '20 at 08:53
  • 2
    @TheMaster - yep, saw that, thanks! Curious why you used wiki-style citation, unusual to see (could you proofread the answer a bit, though?) Re:custom functions - I know, probably poorly worded my comment above. Note about explicit scopes was for other workarounds, not for custom functions, they do indeed can request external APIs without authorization – Oleg Valter is with Ukraine Jun 20 '20 at 08:55
  • 2
    @Magne Although not recommended, Depending on the function workflow, You may also be able to split the custom function with spreadsheet functions: `=JOIN(FUNC(1),FUNC(2))`, where FUNC is a custom function. – TheMaster Jun 20 '20 at 09:00

2 Answers2

2

If you publish a web-app with "anyone, even anonymous", execute as "Me", then the custom function can use UrlFetchApp.fetchAllAuthorization not needed to post to that web-app. This will run in parallelproof. This solves all the three requirements.

Caveat here is: If multiple people use the sheet, and the custom function will have to post to the "same" webapp (that you published to execute as you) for processing, Google will limit simultaneous executionsquota limit:30.

To workaround this, You can ask people using your sheet to publish their own web-apps. They'll have to do this once at the beginning and no authorization is needed.

If not, you'll need to host a custom server for the load or something like might help

Magne
  • 16,401
  • 10
  • 68
  • 88
TheMaster
  • 45,448
  • 6
  • 62
  • 85
  • Thanks for your answer and suggestion. I actually ended up using my naïve workaround, since it stayed more within the confines of vanilla Google Sheets. I reflected a bit on it, and figured it was actually feasible. When looking at the available options and their drawbacks, the naïve approach didn't seem that bad after all. See my answer for the details. Since I initially ruled out that approach as undesired in the original question, I am happy to accept your answer as the correct one. Let me know. Thanks a bunch for looking at it anyway. – Magne Jul 01 '20 at 12:30
  • 1
    @Magne Good to know. You can accept your answer as it worked out for you. I believe the main caveat with spreadsheet+custom functions as said in your answer would be individual custom function failures. Although it might seem robust now, error rates might increase. I maybe wrong though. – TheMaster Jul 01 '20 at 14:25
2

I ended up using the naïve workaround that I mentioned in my post:

I can conceive of an other potential naïve workaround which would be to split the function up into several custom functions formulas and do the parallelization (by some kind of Map/Reduce) inside the spreadsheet itself (storing intermediary results back into the spreadsheet, and having custom function formulas work on that as reducers). But that's undesired, and probably unfeasible, in my case.

I initially disregarded it because it involves having an extra sheet tab with calculations which was not ideal. But when I reflected on it after investigating alternative solutions, it actually solves all the stated requirements in the most non-intrusive manner. Since it doesn't require anything extra from users the spreadsheet is shared with. It also stays 'within' Google Sheets as far as possible (no semi- or fully external Web App needed), doing the parallelization by relying on the native parallelization of concurrently executing spreadsheet cells, where results can be chained, and appear to the user like using regular formulas (no extra menu item or run-this-script-buttons necessary).

So I implemented MapReduce in Google Sheets using custom functions each operating on a slice of the interval I wanted to calculate. The reason I was able to do that, in my case, was that the input to my calculation was divisible into intervals that could each be calculated separately, and then joined later.**

Each parallel custom function then takes in one interval, calculates that, and outputs the results back to the sheet (I recommend to output as rows instead of columns, since columns are capped at 18 278 columns max. See this excellent post on Google Spreadsheet limitations.) I did run into the only 40,000 new rows at a time limitation, but was able to perform some reducing on each interval, so that they only output a very limited amount of rows to the spreadsheet. That was the parallelization; the Map part of MapReduce. Then I had a separate custom function which did the Reduce part, namely: dynamically target*** the spreadsheet output area of the separately calculated custom functions, and take in their results, once available, and join them together while further reducing them (to find the best performing results), to return the final result.

The interesting part was that I thought I would hit the only 30 simultaneous execution quota limit of Google Sheets. But I was able to parallelize up to 64 independently and seemingly concurrently executing custom functions. It may be that Google puts these into a queue if they exceed 30 concurrent executions, and only actually process 30 of them at any given time (please comment if you know). But anyhow, the parallelization benefit/speedup was huge, and seemingly nearly infinitely scalable. But with some caveats:

  1. You have to define the number of parallelised custom functions up front, manually. So the parallelization doesn't infinitely auto-scale according to demand****. This is important because of the counter-intuitive result that in some cases using less parallelization actually executes faster. In my case, the result set from a very small interval could be exceedingly large, while if the interval had been larger then a lot of the results would have been ruled out underway in the algorithm in that parallelised custom function (i.e. the Map also did some reduction).

  2. In rare cases (with huge inputs), the Reducer function will output a result before all of the parallel (Map) functions have completed (since some of them seemingly take too long). So you seemingly have a complete result set, but then a few seconds later it will re-update when the last parallel function returns its result. This is not ideal, so to be notified of this I implemented a function to tell me if the result was valid. I put it in the cell above the Reduce function (and colored the text red). B6 is the number of intervals (here 4), and the other cell references go to the cell with the custom function for each interval: =didAnyExecutedIntervalFail($B$6,S13,AB13,AK13,AT13)

    function didAnyExecutedIntervalFail(intervalsExecuted, ...intervalOutputs) {
      const errorValues = new Set(["#NULL!", "#DIV/0!", "#VALUE!", "#REF!", "#NAME?", "#NUM!", "#N/A","#ERROR!", "#"]);
      // We go through only the outputs for intervals which were included in the parallel execution.
      for(let i=0; i < intervalsExecuted; i++) {
        if (errorValues.has(intervalOutputs[i]))
          return "Result below is not valid (due to errors in one or more of the intervals), even though it looks like a proper result!";
      }
    }
  1. The parallel custom functions are limited by Google quota of max 30 sec execution time for any custom function. So if they take too long to calculate, they still might time out (causing the issue mentioned in the previous point). The way to alleviate this timeout is to parallelise more, dividing into more intervals, so that each parallel custom function runs below 30 second.

  2. The output of it all is limited by Google Sheet limitations. Specifically max 5M cells in a spreadsheet. So you may need to perform some reduction on the size of the results calculated in each parallel custom function, before returning its result to the spreadsheet. So that they each are below 40 000 rows, otherwise you'll receive the dreaded "Results too large" error). Furthermore, depending on the size the result of each parallel custom function, it would also limit how many custom functions you could have at the same time, as they and their result cells take space in the spreadsheet. But if each of them take in total, say 50 cells (including a very small output), then you could still parallelize pretty much (5M / 50 = 100 000 parallel functions) within a single sheet. But you also need some space for whatever you want to do with those results. And the 5M cells limit is for the whole Spreadsheet in total, not just for one of its sheet tabs, apparently.

** For those interested: I basically wanted to calculate all combinations of a sequence of bits (by brute force), so the function was 2^n where n was the number of bits. The initial range of combinations was from 1 to 2^n, so it could be divided into intervals of combinations, for example, if dividing into two intervals, it would be one from 1 to X and then one from X+1 to 2^n.

*** For those interested: I used a separate sheet formula to dynamically determine the range for the output of one of the intervals, based on the presence of rows with content. It was in a separate cell for each interval. For the first interval it was in cell S11 and the formula looked like this: =ADDRESS(ROW(S13),COLUMN(S13),4)&":"&ADDRESS(COUNTA(S13:S)+ROWS(S1:S12),COLUMN(Z13),4) and it would output S13:Z15 which is the dynamically calculated output range, which only counts those rows with content (using COUNTA(S13:S)), thus avoiding to have a statically determined range. Since with a normal static range the size of the output would have to be known in advance, which it wasn't, or it would possibly either not include all of the output, or a lot of empty rows (and you don't want the Reducer to iterate over a lot of essentially empty data structures). Then I would input that range into the Reduce function by using INDIRECT(S$11). So that's how you get the results, from one of the intervals processed by a parallelized custom function, into the main Reducer function.

**** Though you could make it auto-scale up to some pre-defined amount of parallelised custom functions. You could use some preconfigured thresholds, and divide into, say, 16 intervals in some cases, but in other cases automatically divide into 64 intervals (preconfigured, based on experience). You'd then just stop / short-circuit the custom functions which shouldn't participate, based on if the number of that parallelised custom function exceeds the number of intervals you want to divide into and process. On the first line in the parallelised custom function: if (calcIntervalNr > intervals) return;. Though you would have to set up all the parallel custom functions in advance, which can be tedious (remember you have to account for the output area of each, and are limited by the max cell limit of 5M cells in Google Sheets).

Magne
  • 16,401
  • 10
  • 68
  • 88
  • Hm, interesting solution, Magne - have you thought about Cloud Functions, btw? It would also avoid extra steps on users' side and solve the parallelization problem non-intrusively (that said, if you prefer to stay in context of GSheets - that is probably the best that can be done) – Oleg Valter is with Ukraine Jul 02 '20 at 16:49
  • @OlegValter Thanks. I didn't look into Cloud Functions. I presumed it would be more work, and add the complexity of handling a new domain/platform. I presume they would work similarly to a Google Web App, in the sense that the spreadsheet script would have to make external calls. If so, could it be done without users having to know about it and/or give extra permissions? And without me having to continuously run and pay for a deployment? – Magne Jul 06 '20 at 13:10
  • 1
    thank you for responding, will try to go point by point: 1.Permissions - no extra (unless you count extenal request as one), if you make CF publicly accessible or use a service account; 2.Payment - only if you exceed [Free Tier](https://cloud.google.com/functions/pricing), that needs to be calced, but if you do not exceed quotas on custom functions you would probably not exceed the limits here; 3.Complexity - well, given what your project strives to achieve, I am not sure if a whole sheet acting as reducer is less complex than basically a service... – Oleg Valter is with Ukraine Jul 06 '20 at 13:45
  • 1
    ...4. Continously running - since the function would be HTTP triggered it would only count as "running" only as long as the client request is open. If you fork processes to parallelize the mapper part, CPU time could be greatly reduced (I assume the calcs are CPU-heavy, so you would likely need to carefully consider tradeoffs). The downside of all this is, of cource, that in this setup you bear the burden of each user invokation. On the bright side, you can introduce a lock or a queue to limit invokations to a reasonable amount... – Oleg Valter is with Ukraine Jul 06 '20 at 13:48
  • 1
    ...5. Amount of setup - depends on how ignorant you expect your users to be - writing and deploying a simple HTTP-triggered CF is a matter of minutes, and you already have the codebase that calculates the desired result and even a solution that splits the task into chunks - only instead of distributing the chunks between custom functions you would distribute them between child processes (or even instances of CF). – Oleg Valter is with Ukraine Jul 06 '20 at 13:55
  • 1
    @OlegValter Thanks for being very candid! 1 and 2. Seems awesome! 3. I should have specified that by complexity I was primarily concerned with the complexity of me having to learn a new domain/platform. I can still remember the daunting task of having to deal with the crowded AWS interface (littered with 'sociolect' / corp-speak / Amazon-specific product and domain naming) when fiddling around with Lambdas. 4. With the lenient free-tier quota, I would probably be able to bear the burden of all user requests, at no cost or negligible cost. Lock/queue would be great, if it were plug-n-play. – Magne Jul 07 '20 at 15:58
  • 1
    @OlegValter 5. End users should be completely spared for implementation efforts and details. For my own part, I am convinced it seems worth it to look into CF for my next project. So thank you for that! It looks like a solution using CF would have been even smoother than my solution. So if you are willing to share a simple hello-world type answer to this question, showing how easy it would be to parallelise a custom function using CF, that would be awesome, for future reference. A link to a blog post showing the how-to, would be equally helpful, although the SO overlords might frown upon that. – Magne Jul 07 '20 at 16:03
  • 1
    3. Well, yeah, cloud platforms are a bit littered with marketing bs... Things change to the better thankfully - both GCP and Azure are now easier to work with (btw, as an alternative - the same could be done with Azure functions). Re:sample - yep, sure, probably (cannot guarantee you today, though, so do keep track) Re:link - never have I posted a link-only answer, the gods can sleep well :) 4. Lock is sort of plug-n-play, see [`LockService`](https://developers.google.com/apps-script/reference/lock), as for queue - there is a Cloud Scheduler, but I am unsure if it is easy to setup, no exp here – Oleg Valter is with Ukraine Jul 07 '20 at 16:37