Apparently there is a memory leak on BigQuery's UDF. We run a simple UDF over a small table (3000 rows, 5MB) and it fails. If we run the same UDF over the first half of the table concatenated with the second half of the table (in the same query), then it works! Namely:
SELECT blah myUDF(SELECT id,data FROM table)
fails.
SELECT blah myUDF(SELECT id, data FROM table ORDER BY id LIMIT 1500),myUDF(SELECT id, data FROM table ORDER BY id DESC LIMIT 1500)
succeeds.
The question is: how do we work around this issue? Is there a way to dynamically split a table in multiple parts, each of equal size and of predefined number of rows? Say 1000 rows at a time? (the sample table has 3000 rows, but we want this to succeed in larger tables, and if we split a 6000 row table in half, the UDF will be failing again on each half).
In any solution, it is important to (a) NOT use ORDER BY, because it has a 65000 row limitation; (b) use a single combined query (otherwise the solution may be too slow, plus every combined table is charged at a minimum of 10MB, so if we have to split a 1,000,000 row table into 1,000 rows at a time we will automatically be charged for 10 GB. Times 1,000 tables = 10TB. This stuff adds up quickly)
Any ideas?