2

I am currently attempting to run the following SQL script in BigQuery (with the goal of saving the table out) but am getting a SQL processing error relating to the capacity after I start the query without a row return size limit. The error is the common one: "Error: Resources exceeded during query execution."

SELECT date, 
       Concat(fullvisitorid, String(visitid)) AS unique_visit_id, 
       visitid, 
       visitnumber, 
       fullvisitorid, 
       totals.pageviews, 
       totals.bounces, 
       hits.page.pagepath, 
       hits.page.pagetitle, 
       device.devicecategory, 
       device.browser, 
       device.browserversion, 
       hits.customvariables.index, 
       hits.customvariables.customvarname, 
       hits.customvariables.customvarvalue, 
       hits.time 
FROM   (Flatten([XXXXXXXX.ga_sessions_20140711], hits.time)) 
WHERE  hits.customvariables.index = 4 
ORDER  BY unique_visit_id DESC, 
          hits.time ASC 

The job ID that was provided for the job is: ua-web-analytics:job_60fxNnmo9gZ23kaji50v3qH9dXs. I have read the other posts on the topic of these errors such as this post which focuses on the resource errors observed completing a join. I suspect that the issue right now is with the use of FLATTEN, and am working through some different approaches. That said, I am concerned because, in future, this query may be run on 30 or 60 days together (versus just the single day that I am prototyping on right now) which will dramatically increase the data size to over 500GB to 1TB. The goal of the above query was to generate a table which I could save out and then operate on. Unfortunately, doing this in an Ad Hoc manner seems somewhat problematic. Has anyone else encountered resource constraints when using a similar SQL query? For context, the table that is being queried over is about 17.2 GB in size, with just over a million rows.

Tiago Martins Peres
  • 14,289
  • 18
  • 86
  • 145
Nathaniel Payne
  • 2,749
  • 1
  • 28
  • 32

2 Answers2

5

As @Pentium10 mentioned, setting allow large results will allow you to return the larger results from the flattened query. Usually the signal that you should use "allow large results" is that you see a "result too large" error.

However, there is another part of your query that is unparallelizable: the ORDER BY operation. Is this required? Usually, we've found that most of the time when ORDER BY is used on large tables, what people really want is an ORDER BY ... LIMIT (which can be done efficiently and in parallel). Or they are just adding the ORDER BY because it makes it easier to eyeball the results. If you can drop the ORDER BY it will likely make your query faster and scale better as the data size increases.

Jordan Tigani
  • 26,089
  • 4
  • 60
  • 63
  • Interesting. I think the ORDER BY is required. What I am working to do is generate a list for every unique visit showing the specific path at each hit. There is a large CASE statement that specifies that "if the path is XXXX prior to the error message URL YYY", then do something. My thought was to put the paths in order for each visit using the hits.time, and then use the LAG to look back at the path on the row previous to the error message. Did you have any suggestions Jordan on approaching this? If I can get around this without using ORDER BY, then perhaps we will be okay! – Nathaniel Payne Jul 14 '14 at 18:09
  • 1
    As you are going to use LAG, instead of ORDER BY at the end, you could have LAG(x) OVER(PARTITION BY unique_visit_id ORDER BY hits.time). Once you start partitioning, the query becomes parallelizable. – Felipe Hoffa Jul 14 '14 at 20:58
0

I've been through exactly the same problem and I really wanted the ORDER BY, so Jordan Tigani's answer wasn't an option.

  1. Run the query without ORDER BY and save in a dataset table.

  2. Export the content from that table to a bucket in GCS using wildcard (BUCKETNAME/FILENAME*.csv)

  3. Download the files to a folder in your machine.

  4. Install XAMPP (if you get a UAC warning) and change some settings after.

  5. Start Apache and MySQL in your XAMPP control panel.

  6. Install HeidiSQL and stablish the connection with your MySQL server (installed with XAMPP)

  7. Create a database and a table with its fields.

  8. Go to Tools > Import CSV file, configure accordingly and import.

  9. Once all data is imported, do the ORDER BY and export the table.

Tiago Martins Peres
  • 14,289
  • 18
  • 86
  • 145