Efficient BigQuery'ing when selecting/extracting a JSON element

Question

Alternative heading: Subsetting a table before extracting a JSON element

I need to subset a very large table on BigQuery. The column that I will be filtering (joining) on to achieve this subsetting is not a JSON array. However, I would like to include/extract a complimentary column from a JSON array afterwards. No matter how I rearrange my query, it seems to process the full (i.e. non-subsetted) table when I include the extracted JSON element.

As a MWE, consider a query that I'm adapting/borrowing from @felipe-hoffa here:

SELECT id
FROM `githubarchive.day.20180830` 
WHERE type='PushEvent' 
AND id='8188163772'

This query processes 33.9 MB.

However, if I add an extracted column from the JSON array (which again, I'm not subsetting on):

SELECT id, JSON_EXTRACT_SCALAR(payload, '$.size') AS size
FROM `githubarchive.day.20180830` 
WHERE type='PushEvent' 
AND id='8188163772'

... then the process figure jumps to 3.5 GB (i.e. it is querying the whole table).

Any idea on how I could do this more efficiently and keep down per-query costs?

as soon as you touching tat column `payload` - you pay for it even though you use only tiny piece of it! The only way is to consider partitioning / clustering ... — Mikhail Berlyant, Apr 06 '21 at 21:40
Thanks @MikhailBerlyant, that was my fear! Do you mind moving your comment to an answer and I'll mark it as the accepted solution? — Grant, Apr 06 '21 at 22:23

score 1 · Accepted Answer · answered Apr 06 '21 at 22:46

1

as soon as you touching that column payload - you pay for it even though you use only tiny piece of it! The only way is to consider partitioning / clustering ...

answered Apr 06 '21 at 22:46

Mikhail Berlyant

165,386
8
154
230

Efficient BigQuery'ing when selecting/extracting a JSON element

1 Answers1