I'm writing a background job to automatically process A/B test data in BigQuery, and I'm finding that I'm hitting "Resources exceeded during query execution" when doing large GROUP EACH BY statements. I saw from Resources Exceeded during query execution that reducing the number of groups can make queries succeed, so I split up my data into smaller pieces, but I'm still hitting errors (although less frequently). It would be nice to get a better intuition about what actually causes this error. In particular:
- Does "resources exceeded" always mean that a shard ran out of memory, or could it also mean that the task ran out of time?
- What's the right way to approximate the memory usage and the total memory I have available? Am I correct in assuming each shard tracks about 1/n of the groups and keeps the group key and all aggregates for each group, or is there another way that I should be thinking about it?
- How is the number of shards determined? In particular, do I get fewer shards/resources if I'm querying over a smaller dataset?
The problematic query looks like this (in practice, it's used as a subquery, and the outer query aggregates the results):
SELECT
alternative,
snapshot_time,
SUM(column_1),
...
SUM(column_139)
FROM
my_table
CROSS JOIN
[table containing 24 unix timestamps] timestamps
WHERE last_updated_time < timestamps.snapshot_time
GROUP EACH BY alternative, user_id, snapshot_time
(Here's an example failed job: 124072386181:job_XF6MksqoItHNX94Z6FaKpuktGh4 )
I realize this query may be asking for trouble, but in this case, the table is only 22MB and the query results in under a million groups and it's still failing with "resources exceeded". Reducing the number of timestamps to process at once fixes the error, but I'm worried that I'll eventually hit a data scale large enough that this approach as a whole will stop working.