0

Alternative heading: Subsetting a table before extracting a JSON element

I need to subset a very large table on BigQuery. The column that I will be filtering (joining) on to achieve this subsetting is not a JSON array. However, I would like to include/extract a complimentary column from a JSON array afterwards. No matter how I rearrange my query, it seems to process the full (i.e. non-subsetted) table when I include the extracted JSON element.

As a MWE, consider a query that I'm adapting/borrowing from @felipe-hoffa here:

SELECT id
FROM `githubarchive.day.20180830` 
WHERE type='PushEvent' 
AND id='8188163772'

This query processes 33.9 MB.

However, if I add an extracted column from the JSON array (which again, I'm not subsetting on):

SELECT id, JSON_EXTRACT_SCALAR(payload, '$.size') AS size
FROM `githubarchive.day.20180830` 
WHERE type='PushEvent' 
AND id='8188163772'

... then the process figure jumps to 3.5 GB (i.e. it is querying the whole table).

Any idea on how I could do this more efficiently and keep down per-query costs?

Grant
  • 1,536
  • 13
  • 25
  • as soon as you touching tat column `payload` - you pay for it even though you use only tiny piece of it! The only way is to consider partitioning / clustering ... – Mikhail Berlyant Apr 06 '21 at 21:40
  • Thanks @MikhailBerlyant, that was my fear! Do you mind moving your comment to an answer and I'll mark it as the accepted solution? – Grant Apr 06 '21 at 22:23

1 Answers1

1

as soon as you touching that column payload - you pay for it even though you use only tiny piece of it! The only way is to consider partitioning / clustering ...

Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230