0

I'm working in BigQuery on Google Analytics data. At various points in developing the query I get the error: "Resources exceeded". I want to further my understanding of what's happening. I've successfully worked around the problem, but only via trial and error.

When I use the explain tool it seems to be the 'compute' part of any query or sub-query that looks to have exceeded resources.

Here's an example of a standard SQL query that succeeds/fails depending on whether certain parts are left in:

  SELECT
    fullVisitorId,
    visitId,
    h.type AS type,
    h.hitNumber AS hitNumber,
    h.eventInfo.eventAction AS action,
    LOWER(h.eventInfo.eventCategory) AS category,
    h.page.pagePath AS page,
    h.page.pageTitle AS landingTitle,
    h.page.searchKeyword AS searchTerm,
    LEAD(h.page.pagePath) OVER (PARTITION BY fullVisitorId, visitId ORDER BY h.hitNumber ASC) AS landingPage,
    SPLIT(h.eventInfo.eventLabel, ':')[OFFSET(0)] AS clickTitle,
    CASE WHEN LEAD(h.page.pageTitle) OVER (PARTITION BY fullVisitorId, visitId ORDER BY h.hitNumber ASC) = SPLIT(h.eventInfo.eventLabel, ':')[OFFSET(0)] THEN true ELSE false END AS searchClick     
 FROM `project.dataset.ga_sessions_*` AS main, UNNEST(hits) AS h 
  WHERE _TABLE_SUFFIX BETWEEN '20170401' AND '20170430'
AND (
  (
    h.eventInfo.eventAction = 'click' AND LOWER(h.eventInfo.eventCategory) LIKE '/search%'
  )
    OR type = 'PAGE'
)
ORDER BY
  fullVisitorId ASC, visitId ASC, h.hitNumber ASC

When removing any one of these sets of elements the query runs:

ORDER BY
  fullVisitorId ASC, visitId ASC, h.hitNumber ASC

Or:

LEAD(h.page.pagePath) OVER (PARTITION BY fullVisitorId, visitId ORDER BY h.hitNumber ASC) AS landingPage,
SPLIT(h.eventInfo.eventLabel, ':')[OFFSET(0)] AS clickTitle,
CASE WHEN LEAD(h.page.pageTitle) OVER (PARTITION BY fullVisitorId, visitId ORDER BY h.hitNumber ASC) = SPLIT(h.eventInfo.eventLabel, ':')[OFFSET(0)] THEN true ELSE false END AS searchClick

Or:

When running on a single date partition the entire query runs.

I would describe my current level of understanding as superficial, I know little of the inner workings of GBQ and how it allocates/permits compute resources. I do know that it performs calculations on separate machines where possible. I've heard these described as shards before.

What do I need to know about GBQ compute resources in order to understand why the above will work/not work?

N.B: I only have Tier 1 access, but that doesn't mean I can't gain increased access if I can justify a need. Obviously I don't want to do this with current level of understanding.

goose
  • 2,502
  • 6
  • 42
  • 69

1 Answers1

3

I think the only thing that should be causing a problem in your query is the ORDER BY operation. As you can see in this answer from Jordan, this operation is not parallelizable. You can also check the docs for some ideas of what causes the Resources Exceeded error.

The rest of the query seems to be fine though. I tested your query against our data and it processed almost 300Gb in 20s:

enter image description here

If you still get the error then maybe you are querying quite a high amount of data. This being the case, you could try breaking the query into smaller dates range, querying for less columns, adding some WHERE conditions to filter out some rows, changing tier and so on.

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Willian Fuks
  • 11,259
  • 10
  • 50
  • 74
  • Thanks for your response Will. So am I right in thinking that it's not possible to use ORDER BY on data sets that are > ~6.5GB? That seems to be the point where it breaks. – goose Jun 16 '17 at 15:11
  • Not sure what the threshold is but usually if your result set have million of rows then probably `ORDER BY` won't work. What I found in my day to day work is that you actually never really need to order big datasets. Usually most tasks you are just fine by limiting the result and then ordering it (hopefully this is the case for you). – Willian Fuks Jun 16 '17 at 15:20
  • Yes that is the case thankfully. The order by statements got left in from when the query (not that shown above) was being worked up. I'm still interested to know more about how GBQ parrallelises queries. If you know any good resources pls feel free to share. – goose Jun 16 '17 at 15:24
  • Anything you'd like to know more about of a Google product you can look at https://research.google.com/ and search there. For BigQuery: https://research.google.com/search.html?q=bigquery. Also, they have an amazing blog: https://cloud.google.com/blog/big-data/. On top of that, GCP documentation is really good: https://cloud.google.com/bigquery/docs/ – Willian Fuks Jun 16 '17 at 15:29
  • Thanks again Will - I particularly like the blog titles listed here: https://cloud.google.com/blog/big-data/ – goose Jun 19 '17 at 07:48