Workaround for UDF memory leak

Question

Apparently there is a memory leak on BigQuery's UDF. We run a simple UDF over a small table (3000 rows, 5MB) and it fails. If we run the same UDF over the first half of the table concatenated with the second half of the table (in the same query), then it works! Namely:
SELECT blah myUDF(SELECT id,data FROM table)
fails.
SELECT blah myUDF(SELECT id, data FROM table ORDER BY id LIMIT 1500),myUDF(SELECT id, data FROM table ORDER BY id DESC LIMIT 1500)
succeeds.

The question is: how do we work around this issue? Is there a way to dynamically split a table in multiple parts, each of equal size and of predefined number of rows? Say 1000 rows at a time? (the sample table has 3000 rows, but we want this to succeed in larger tables, and if we split a 6000 row table in half, the UDF will be failing again on each half).

In any solution, it is important to (a) NOT use ORDER BY, because it has a 65000 row limitation; (b) use a single combined query (otherwise the solution may be too slow, plus every combined table is charged at a minimum of 10MB, so if we have to split a 1,000,000 row table into 1,000 rows at a time we will automatically be charged for 10 GB. Times 1,000 tables = 10TB. This stuff adds up quickly)
Any ideas?

I'm digging into root causes for JS OOM as we speak, stay tuned! Just out of curiosity, why do you think BigQuery has a 65,000 row limit on ORDER BY? Here is a query that orders almost 19 million rows : `SELECT [by] FROM [bigquery-public-data:hacker_news.full_201510] order by 1`. — thomaspark, Mar 22 '16 at 17:19
Do you have a BigQuery job id you can share for your failing query? I will add it to my repro list. — thomaspark, Mar 22 '16 at 17:23
We found a (prob temporary) fix, so it isn't failing any more. Here is one that did fail, but be aware that we have replaced the UDF with one that works, and the input table is no longer there (not needed after the computation succeeded): academic-diode-113417:bquijob_193cc0f5_1539c3a2c91 — user3688176, Mar 23 '16 at 22:09
On the ORDER BY: we had a failing large query with several subqueries over multiple tables, with one subquery having an ORDER BY and several having GROUP BY. We first used "EACH" on every join, group by and order. I can't find the reference right now, but in my research on why it was failing I read somewhere that ordering may cause the job to be handled by a single worker and thus fail at 65k records. Sure enough, when we removed the order by clause the query worked fine. Before you suggest that "EACH" should not be used, we tried that on a simple and small query and it failed miserably — user3688176, Mar 23 '16 at 22:25
Hm, ORDER BY works for arbitrary numbers of rows, as long as the query can fit underneath the "large results" threshold. Once the "use large results" box is checked, then we can't apply ORDER BY. — thomaspark, Mar 24 '16 at 21:42
I pulled your job with academic-diode-113417:bquijob_193cc0f5_1539c3a2c91, and it looks like the table you were referencing in that job no longer exists. I have a possible fix for the OOM issue. Could you upload a version of the table for me to validate against? — thomaspark, Mar 24 '16 at 21:44
I recreated that table, so you can try again. But be aware that we have modified the UDF and it now should work as is. (the modification was to reduce the depth of the if/then blocks). On the Order By: we have not tried running our queries without the "use large results" flag, as that also tends to fail. Are there easy, hard and fast rules for us to decide when a query (esp one with multiple subqueries) will require the "use large results" flag? — user3688176, Mar 25 '16 at 13:28
Re: "allow large results", this is based on the size of the result set that's returned. We allow up to 128M of **compressed** results to be returned without that flag. Unfortunately it's very difficult for end users to calculate how much **uncompressed** data that equates to, so the best answer I can give is to try it without the flag and see if the query fails :( — thomaspark, Mar 25 '16 at 17:37
Quick update - all of the other queries I've been able to rerun that were failing with similar root causes are now running happily with my update, which just tunes some heap parameters in our V8. We will push this out ASAP next week; I'll update with a new answer when the push is done. I'm happy to try running your query if I can get a job id that corresponds to tables that currently exist :) — thomaspark, Mar 25 '16 at 21:12
Could you please delete your comment that mentions the table name? We'd rather not have that floating around. — user3688176, Mar 28 '16 at 14:24
On "allow large results": does the 128M limit apply to results of subqueries, or to the end result? For example, in a query SELECT ... FROM (SELECT ... FROM) if the subquery returns 1G but the outside SELECT returns 1M, is the flag required? — user3688176, Mar 28 '16 at 14:25
Comment deleted! Re: "allow large results" - this just governs the final output size. If you do an enormous SELECT .. ORDER BY ... and then LIMIT *n* in an outer SELECT, that will work as long as (1) the *n* rows fit under the "large results" threshold, and (2) we're able to do the ORDER BY on the backend without exceeding available memory. — thomaspark, Mar 28 '16 at 19:25
Shoot, I forgot to mention, I don't have access to your code in Google Cloud Storage. Could you please retry the version of your code that was failing? — thomaspark, Mar 28 '16 at 20:55

score 1 · Accepted Answer · answered Mar 28 '16 at 20:56

1

This issue was related to limits we had on the size of the UDF code. It looks like V8's optimize+recompile pass of the UDF code generates a data segment that was bigger than our limits, but this was only happening when when the UDF runs over a "sufficient" number of rows. I'm meeting with the V8 team this week to dig into the details further.

In the meantime, we've rolled out a fix to update the max data segment size. I've verified that this fixes several other queries that were failing for the same reason.

Could you please retry your query? I'm afraid I can't easily get to your code resources in GCS.

answered Mar 28 '16 at 20:56

thomaspark

488
3
14

FYI, there is a possibility you'll run into an "internal error" due to an issue being discussed on http://stackoverflow.com/a/36275462/5265394. The fix for that should be deployed by EOW. – thomaspark Mar 29 '16 at 16:56
Thanks Thomas. Haven't revisited this yet as we had refactored the code and we worked around the issue (and now we're swamped with other stuff), but will next week. I'm sure we'll want to run more UDFs in the future though, so it's important to keep fixing the V8 engine! – user3688176 Apr 01 '16 at 13:43
Sounds good, thanks! For the record, this wasn't an issue in V8 per se, the problem was just in BigQuery's usage of V8 - specifically, some of the configuration restrictions we were applying to it. – thomaspark Apr 01 '16 at 23:45
Update: we ran our js code on a MUCH larger input (about 100x) and it still worked. So apparently whatever you changed fixed it! And it appears that the V8 engine is running on all cylinders now, it's much faster. Thanks! – user3688176 Apr 06 '16 at 18:58
Super! Thanks for the followup, glad to hear it's performing much better now also :) – thomaspark Apr 11 '16 at 16:45
The beast is resurfacing... :-( Did you change anything on your end on or around April 24th 2016? – user3688176 Apr 26 '16 at 20:19
@user3688176 - no, I'm afraid we didn't. We did a deployment several days earlier. Do you have a job id I can look into? – thomaspark May 02 '16 at 20:20
It turns out we added some data that was generating large inputs per row, which is what caused the problem. Worked around it for now – user3688176 May 04 '16 at 20:33
Thanks for the followup! – thomaspark May 09 '16 at 23:39

Workaround for UDF memory leak

1 Answers1