I have the following transaction
table:
I would like to calculate the total quantity purchased for each:
product
category
(i.e. total quantity of all products within the samecategory
)department
(i.e. total quantity of all products within the samedepartment
)
In addition, the above totals should be computed:
- per individual shopper
- per family/household (Sum of total quantities for all shoppers in the same family).
The output table should look like:
For the family, the totals are calculated once and then "copied" to each shopper in the same household.
To calculate the multiple totals for product
/category
/department
across the table, I'm using GROUPING SETS
as pointed out to me in an earlier question here. So I got the total_quantity_individual
right.
For total_quantity_family
, it would make sense to use OVER(PARTITION BY)
as pointed out here on a much simpler table.
However, I'm not sure how to combine both together. There's not much information out there on combining OVER(PARTITION BY)
with GROUPING SETS
.
My query looks like:
SELECT
family_id,
shopper_id,
CASE
WHEN GROUPING__ID = 6 THEN 'department'
WHEN GROUPING__ID = 5 THEN 'category'
WHEN GROUPING__ID = 3 THEN 'product'
END AS total_level_type,
CASE
WHEN GROUPING__ID = 6 THEN department
WHEN GROUPING__ID = 5 THEN category
WHEN GROUPING__ID = 3 THEN product
END AS id,
SUM(quantity) AS total_quantity_shopper
-- sum(sum(quantity)) OVER (PARTITION BY family_id, product) AS total_quantity_family
FROM
transaction
GROUP BY
family_id,
shopper_id,
product,
category,
department
GROUPING SETS (
(family_id, shopper_id, product),
(family_id, shopper_id, category),
(family_id, shopper_id, department)
)
ORDER BY
total_level_type;
If OVER(PARTITION BY)
does not work for my case, my other options might be:
- Group
transaction
byfamily_id
, then run GROUPING SETS on the result, then join back withtransaction
. - Maybe a trick using explode() and lateral view?
I really don't want to separate out the queries between the individual vs family versions for maintainability reasons.
Note: I'm using Spark SQL with Hive context if that helps.
Appreciate any help. Thanks!
Edit: This seems to work:
...
SUM(quantity) AS total_quantity_shopper,
CASE
WHEN GROUPING__ID = 6 THEN sum(sum(quantity)) OVER (PARTITION BY family_id, department)
WHEN GROUPING__ID = 5 THEN sum(sum(quantity)) OVER (PARTITION BY family_id, category)
WHEN GROUPING__ID = 3 THEN sum(sum(quantity)) OVER (PARTITION BY family_id, product)
END AS total_quantity_family
...