0

This is a follow-up question to Multi-level pivot in Google BigQuery, in which I wanted to know if it was possible to construct a nested pivot table in Google BigQuery using a single query. It is, and so in this follow-up question, I'd like to explore the general case.

Here is an example of the data that I'm using (which is also included in this shared Google Sheet) :

enter image description here

Now, I would like to build a pivot table that has the following properties:

  • Nested levels at both the row and col level (the previous question only had nested-cols)
  • Sub-totals within both the rows and cols (the previous only had a grand total)
  • Multiple metrics (the previous only had a single metric)
  • Multiple sorts -- by both deep metrics and by alphabetical (the previous did not have any sort conditions)
  • Limits (the previous did not have any limits at all)

Here is the pivot built in Google Sheets --

enter image description here

The conceptual SQL statement here would be:

SELECT
    SUM(price),
    COUNT(price) 
BROKEN DOWN BY
    Studio (row),
    Title (row)
    Territory ID (col),
    Type (col)
SORTED/LIMITED BY
    Studio ==> A-Z, LIMIT 3,
    Title ==> SUM(price) in GRAND TOTAL DESC, LIMIT 4,
    Territory ID ==> COUNT(price) in Paramount TOTAL, LIMIT 2
    Type ==> A-Z, NO LIMIT

I'm not sure how to conceptually show the Subtotals in, but we should be able to specify those for each of the broken-down-by fields.

Is it possible to do the above in a single SQL statement in Google BigQuery? What would be the steps to generate it?

Community
  • 1
  • 1
David542
  • 104,438
  • 178
  • 489
  • 842
  • to better understand question - can you please explain why you want do all this in BigQuery rather than in relevant pivot tool as you already have in your demo Google Sheet for example? – Mikhail Berlyant Feb 23 '17 at 06:03
  • @MikhailBerlyant well, we're basically bringing our data into Excel or Google Sheets to test it out on small queries and help us better visualize it. The actual data that we have is too big for Google Sheets or Excel or PowerPivot. So we need to do this in-database. – David542 Feb 23 '17 at 23:30
  • so mostly you are trying to minimize volume of the data that you bring to visualization tool - right? and visualization itself still can be done in that tool. Am I correct? if this is correct - this would totally changed the way your question look to me right now – Mikhail Berlyant Feb 24 '17 at 00:39
  • @MikhailBerlyant that's correct, we're trying to pivot the data to be able to export it or visualize it in another tool. – David542 Feb 24 '17 at 01:28
  • i am still lost - why you need to pivot data before export to tool? pivot is a great feature of the tool like Excel or Google Sheet or PowerPivot! You do not need do this in BigQuery at all. I would understand that you want to minimize volume of the data to play with in your tool - but this does not require pivoting!! if pivoting/visualization is out of scope of your question - your question becomes reasonable and practical and still challenging so there is still a room for thinking and helping you. Otherwise i don't think it is practical – Mikhail Berlyant Feb 24 '17 at 01:33
  • @MikhailBerlyant take a look at this question: http://stackoverflow.com/questions/42381948/adding-a-pivot-to-a-rollup. There's a screenshot of the interface there. Basically BigQuery is the backend for being able to create Excel-like charts and pivots on the front-end. – David542 Feb 24 '17 at 01:44
  • ok, I see. so it means that you do really try to use bigquery to build pivot here and than bring it to front-end. in my mind this is a dead-end. you should separate data retrieval and move pivot/visualization part to front end. Meantime, theoretically - I think it is possible to do entirely in bigquery but just too much of a headache – Mikhail Berlyant Feb 24 '17 at 01:49
  • @MikhailBerlyant how could that be done? For example, what if we do an aggregation and have 10M results, unless we are applying the limits, etc. in bigquery -- the amount of data transferred would take a tremendous amount of time. – David542 Feb 24 '17 at 01:59
  • you ARE BACK to my point! reducing size and pivoting ARE TWO different areas here! please read my comments again! i would see reasonable to set goal of reducing size to ONLY data that needed to be further (on front-end) visualized/pivoted/etc. but you insist on full pivoting which makes it look no good to me – Mikhail Berlyant Feb 24 '17 at 02:03
  • @MikhailBerlyant oh, I see -- got it. – David542 Feb 24 '17 at 02:41
  • @MikhailBerlyant thanks for your help on this -- could I send you an email to discuss and perhaps see if you'd be interested in helping us on this implementation? My email is deemarklit@gmail.com. Thanks – David542 Feb 24 '17 at 19:51

1 Answers1

2

Q. what if we do an aggregation and have 10M results? unless we are applying the limits, etc. in bigquery -- the amount of data transferred would take a tremendous amount …


Let's clarify the challenge here:

So usually, you would run something like below in back-end and pull result up to visualization tool (front-end) for further manipulations like sorts, limits, pivoting, etc.

#standardSQL
SELECT
  Studio, 
  Title, 
  TerritoryID,
  Type, 
  SUM(Price) AS Price, 
  COUNT(1) AS Volume
FROM YourTable  
GROUP BY Studio, Title, TerritoryID, Type   

As you mentioned, such result in your case can easily produce 10M+ rows and you want to reduce size of it w/o affecting your ability to still present final data in your pivot/visualization in front-end


A. Recommendation / Solution

Below shows how to achieve this by applying sorts and limits on back-end (so result size is drastically reduced) w/o losing ability to do pivoting and still show totals, etc.

Let’s get to final query by starting with simplified one

  • Initial query (skeleton)

Let’s assume, based on known criteria, that we know in advance which Studios, Titles, Territories and Types should be selected
In this case, below query will return desired data

#standardSQL
WITH Studios AS (
  SELECT 'Fox' 
  UNION ALL SELECT 'Paramouont' 
),
Titles AS (
  SELECT 'Fox' AS Studio,'Best Laid Plans' AS Title
  UNION ALL SELECT 'Fox','Homecoming'
  UNION ALL SELECT 'Paramount','Titanic'
  UNION ALL SELECT 'Paramount','Homecoming'
),
Territories AS (
  SELECT 'US' AS TerritoryID
  UNION ALL SELECT 'GB'
),
Totals AS (
  SELECT 
    IFNULL(b.Studio,'Other') AS Studio, 
    IFNULL(b.Title,'Other') AS Title, 
    IFNULL(c.TerritoryID,'Other') AS TerritoryID, 
    Type,
    ROUND(SUM(Price), 2) AS Price, COUNT(1) AS Volume
  FROM yourTable AS a 
  LEFT JOIN Titles AS b ON a.Studio = b.Studio AND a.Title = b.Title
  LEFT JOIN Territories AS c ON a.TerritoryID = c.TerritoryID
  GROUP BY Studio, Title, TerritoryID, Type
)
SELECT * FROM Totals
ORDER BY Studio, Title, TerritoryID, Type

The output will be something as below

Studio      Title           TerritoryID Type        Price    Volume  
Fox         Best Laid Plans GB          Movie         87.32    18    
Fox         Best Laid Plans GB          TV Episode    50.17    23    
Fox         Best Laid Plans Other       TV Episode  1131.0      2    
Fox         Best Laid Plans US          Movie        120.82    18    
Fox         Best Laid Plans US          TV Episode    53.76    24    
Fox         Homecoming      GB          TV Episode    60.22    28    
Fox         Homecoming      Other       TV Episode  2262.0      4    
Fox         Homecoming      US          TV Episode   128.45    58    
Other       Other           GB          Movie        142.71    29    
Other       Other           GB          TV Episode    84.8     40    
Other       Other           Other       Movie       3292.0      4    
Other       Other           Other       TV Episode  3282.0     16    
Other       Other           US          Movie         52.92     8    
Other       Other           US          TV Episode   233.05   101    
Paramount   Homecoming      GB          Movie         18.96     4    
Paramount   Homecoming      US          Movie        124.84    16    
Paramount   Titanic         GB          Movie         41.92     8    
Paramount   Titanic         Other       Movie         12.0      4    
Paramount   Titanic         US          Movie        139.84    16   

You can easily feed it back to your UI to visualize it in whatever way you need

  • “Final” query

Now, instead of hard-coded values in all involved dimensions - let’s implement actual criteria(s) for each dimension.
So the only changes in below query (vs above skeleton query) are in following CTEs: Studios, Titles, and Territories

#standardSQL
WITH Studios AS (
  SELECT DISTINCT Studio 
  FROM yourTable 
  ORDER BY Studio LIMIT 3
),
Titles AS (
  SELECT Studio, Title 
  FROM (
    SELECT Studio, Title, ROW_NUMBER() OVER(PARTITION BY Studio ORDER BY PRICE DESC) AS pos
    FROM (SELECT Studio, Title, SUM(Price) AS Price FROM yourTable GROUP BY Studio, Title)
  ) WHERE pos <= 4
),
Territories AS (
  SELECT TerritoryID FROM yourTable  
  WHERE Studio = 'Paramount' GROUP BY TerritoryID
  ORDER BY COUNT(1) DESC LIMIT 2
),
Totals AS (
  SELECT 
    IFNULL(b.Studio,'Other') AS Studio, 
    IFNULL(b.Title,'Other') AS Title, 
    IFNULL(c.TerritoryID,'Other') AS TerritoryID, 
    Type,
    ROUND(SUM(Price), 2) AS Price, COUNT(1) AS Volume
  FROM yourTable AS a 
  LEFT JOIN Titles AS b ON a.Studio = b.Studio AND a.Title = b.Title
  LEFT JOIN Territories AS c ON a.TerritoryID = c.TerritoryID
  GROUP BY Studio, Title, TerritoryID, Type
)
SELECT * FROM Totals
WHERE NOT 'Other' IN (TerritoryID)
ORDER BY Studio, TerritoryID DESC, Type, Price DESC, Title

The result here is:

Studio      Title           TerritoryID Type        Price  Volume    
Fox         Best Laid Plans         US  Movie       120.82  18   
Fox         Titanic                 US  Movie        52.92   8   
Fox         1:00 P.M. - 2:00 P.M.   US  TV Episode  187.25  81   
Fox         Homecoming              US  TV Episode  128.45  58   
Fox         Best Laid Plans         US  TV Episode   53.76  24   
Fox         Best Laid Plans         GB  Movie        87.32  18   
Fox         Titanic                 GB  Movie        78.84  16   
Fox         1:00 P.M. - 2:00 P.M.   GB  TV Episode   61.42  28   
Fox         Homecoming              GB  TV Episode   60.22  28   
Fox         Best Laid Plans         GB  TV Episode   50.17  23   
Paramount   Titanic                 US  Movie       139.84  16   
Paramount   Homecoming              US  Movie       124.84  16   
Paramount   Titanic                 GB  Movie        41.92   8   
Paramount   Homecoming              GB  Movie        18.96   4   
Sony        Best Laid Plans         US  TV Episode   22.9   10   
Sony        Homecoming              US  TV Episode   22.9   10   
Sony        Best Laid Plans         GB  Movie        63.87  13   
Sony        Homecoming              GB  TV Episode   18.81   9   
Sony        Best Laid Plans         GB  TV Episode    4.57   3       

The point here is - while BigQuery is extremely efficient in analyzing billions of rows and extracting needed info, It is quite ineficient to use BigQuery to actually tailor result data to reflect how this result will actually be presented in presentation layer on client UI. Instead - you should just pass this data to UI and have your visualization code to handle it

Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230