Q. what if we do an aggregation and have 10M results? unless we are applying the limits, etc. in bigquery -- the amount of data transferred would take a tremendous amount …
Let's clarify the challenge here:
So usually, you would run something like below in back-end and pull result up to visualization tool (front-end) for further manipulations like sorts, limits, pivoting, etc.
#standardSQL
SELECT
Studio,
Title,
TerritoryID,
Type,
SUM(Price) AS Price,
COUNT(1) AS Volume
FROM YourTable
GROUP BY Studio, Title, TerritoryID, Type
As you mentioned, such result in your case can easily produce 10M+ rows and you want to reduce size of it w/o affecting your ability to still present final data in your pivot/visualization in front-end
A. Recommendation / Solution
Below shows how to achieve this by applying sorts and limits on back-end (so result size is drastically reduced) w/o losing ability to do pivoting and still show totals, etc.
Let’s get to final query by starting with simplified one
Let’s assume, based on known criteria, that we know in advance which Studios, Titles, Territories and Types should be selected
In this case, below query will return desired data
#standardSQL
WITH Studios AS (
SELECT 'Fox'
UNION ALL SELECT 'Paramouont'
),
Titles AS (
SELECT 'Fox' AS Studio,'Best Laid Plans' AS Title
UNION ALL SELECT 'Fox','Homecoming'
UNION ALL SELECT 'Paramount','Titanic'
UNION ALL SELECT 'Paramount','Homecoming'
),
Territories AS (
SELECT 'US' AS TerritoryID
UNION ALL SELECT 'GB'
),
Totals AS (
SELECT
IFNULL(b.Studio,'Other') AS Studio,
IFNULL(b.Title,'Other') AS Title,
IFNULL(c.TerritoryID,'Other') AS TerritoryID,
Type,
ROUND(SUM(Price), 2) AS Price, COUNT(1) AS Volume
FROM yourTable AS a
LEFT JOIN Titles AS b ON a.Studio = b.Studio AND a.Title = b.Title
LEFT JOIN Territories AS c ON a.TerritoryID = c.TerritoryID
GROUP BY Studio, Title, TerritoryID, Type
)
SELECT * FROM Totals
ORDER BY Studio, Title, TerritoryID, Type
The output will be something as below
Studio Title TerritoryID Type Price Volume
Fox Best Laid Plans GB Movie 87.32 18
Fox Best Laid Plans GB TV Episode 50.17 23
Fox Best Laid Plans Other TV Episode 1131.0 2
Fox Best Laid Plans US Movie 120.82 18
Fox Best Laid Plans US TV Episode 53.76 24
Fox Homecoming GB TV Episode 60.22 28
Fox Homecoming Other TV Episode 2262.0 4
Fox Homecoming US TV Episode 128.45 58
Other Other GB Movie 142.71 29
Other Other GB TV Episode 84.8 40
Other Other Other Movie 3292.0 4
Other Other Other TV Episode 3282.0 16
Other Other US Movie 52.92 8
Other Other US TV Episode 233.05 101
Paramount Homecoming GB Movie 18.96 4
Paramount Homecoming US Movie 124.84 16
Paramount Titanic GB Movie 41.92 8
Paramount Titanic Other Movie 12.0 4
Paramount Titanic US Movie 139.84 16
You can easily feed it back to your UI to visualize it in whatever way you need
Now, instead of hard-coded values in all involved dimensions - let’s implement actual criteria(s) for each dimension.
So the only changes in below query (vs above skeleton query) are in following CTEs: Studios, Titles, and Territories
#standardSQL
WITH Studios AS (
SELECT DISTINCT Studio
FROM yourTable
ORDER BY Studio LIMIT 3
),
Titles AS (
SELECT Studio, Title
FROM (
SELECT Studio, Title, ROW_NUMBER() OVER(PARTITION BY Studio ORDER BY PRICE DESC) AS pos
FROM (SELECT Studio, Title, SUM(Price) AS Price FROM yourTable GROUP BY Studio, Title)
) WHERE pos <= 4
),
Territories AS (
SELECT TerritoryID FROM yourTable
WHERE Studio = 'Paramount' GROUP BY TerritoryID
ORDER BY COUNT(1) DESC LIMIT 2
),
Totals AS (
SELECT
IFNULL(b.Studio,'Other') AS Studio,
IFNULL(b.Title,'Other') AS Title,
IFNULL(c.TerritoryID,'Other') AS TerritoryID,
Type,
ROUND(SUM(Price), 2) AS Price, COUNT(1) AS Volume
FROM yourTable AS a
LEFT JOIN Titles AS b ON a.Studio = b.Studio AND a.Title = b.Title
LEFT JOIN Territories AS c ON a.TerritoryID = c.TerritoryID
GROUP BY Studio, Title, TerritoryID, Type
)
SELECT * FROM Totals
WHERE NOT 'Other' IN (TerritoryID)
ORDER BY Studio, TerritoryID DESC, Type, Price DESC, Title
The result here is:
Studio Title TerritoryID Type Price Volume
Fox Best Laid Plans US Movie 120.82 18
Fox Titanic US Movie 52.92 8
Fox 1:00 P.M. - 2:00 P.M. US TV Episode 187.25 81
Fox Homecoming US TV Episode 128.45 58
Fox Best Laid Plans US TV Episode 53.76 24
Fox Best Laid Plans GB Movie 87.32 18
Fox Titanic GB Movie 78.84 16
Fox 1:00 P.M. - 2:00 P.M. GB TV Episode 61.42 28
Fox Homecoming GB TV Episode 60.22 28
Fox Best Laid Plans GB TV Episode 50.17 23
Paramount Titanic US Movie 139.84 16
Paramount Homecoming US Movie 124.84 16
Paramount Titanic GB Movie 41.92 8
Paramount Homecoming GB Movie 18.96 4
Sony Best Laid Plans US TV Episode 22.9 10
Sony Homecoming US TV Episode 22.9 10
Sony Best Laid Plans GB Movie 63.87 13
Sony Homecoming GB TV Episode 18.81 9
Sony Best Laid Plans GB TV Episode 4.57 3
The point here is - while BigQuery is extremely efficient in analyzing billions of rows and extracting needed info, It is quite ineficient to use BigQuery to actually tailor result data to reflect how this result will actually be presented in presentation layer on client UI. Instead - you should just pass this data to UI and have your visualization code to handle it