1

I have the following transaction table:

enter image description here

I would like to calculate the total quantity purchased for each:

  • product
  • category (i.e. total quantity of all products within the same category)
  • department (i.e. total quantity of all products within the same department)

In addition, the above totals should be computed:

  1. per individual shopper
  2. per family/household (Sum of total quantities for all shoppers in the same family).

The output table should look like:

enter image description here

For the family, the totals are calculated once and then "copied" to each shopper in the same household.

To calculate the multiple totals for product/category/department across the table, I'm using GROUPING SETS as pointed out to me in an earlier question here. So I got the total_quantity_individual right.

For total_quantity_family, it would make sense to use OVER(PARTITION BY) as pointed out here on a much simpler table.

However, I'm not sure how to combine both together. There's not much information out there on combining OVER(PARTITION BY) with GROUPING SETS.

My query looks like:

SELECT
    family_id,
    shopper_id,
    CASE
        WHEN GROUPING__ID = 6 THEN 'department'
        WHEN GROUPING__ID = 5 THEN 'category'
        WHEN GROUPING__ID = 3 THEN 'product'
    END AS total_level_type,
    CASE
        WHEN GROUPING__ID = 6 THEN department
        WHEN GROUPING__ID = 5 THEN category
        WHEN GROUPING__ID = 3 THEN product
    END AS id,
    SUM(quantity) AS total_quantity_shopper
    -- sum(sum(quantity)) OVER (PARTITION BY family_id, product) AS total_quantity_family
FROM
    transaction
GROUP BY
    family_id, 
    shopper_id,
    product,
    category,
    department
    GROUPING SETS (
        (family_id, shopper_id, product),
        (family_id, shopper_id, category),
        (family_id, shopper_id, department)
    )
ORDER BY
  total_level_type;

If OVER(PARTITION BY) does not work for my case, my other options might be:

  1. Group transaction by family_id, then run GROUPING SETS on the result, then join back with transaction.
  2. Maybe a trick using explode() and lateral view?

I really don't want to separate out the queries between the individual vs family versions for maintainability reasons.

Note: I'm using Spark SQL with Hive context if that helps.

Appreciate any help. Thanks!

Edit: This seems to work:

...
SUM(quantity) AS total_quantity_shopper,
CASE        
    WHEN GROUPING__ID = 6 THEN sum(sum(quantity)) OVER (PARTITION BY family_id, department)
    WHEN GROUPING__ID = 5 THEN sum(sum(quantity)) OVER (PARTITION BY family_id, category)
    WHEN GROUPING__ID = 3 THEN sum(sum(quantity)) OVER (PARTITION BY family_id, product)
END AS total_quantity_family
...
alhazen
  • 1,907
  • 3
  • 22
  • 43

1 Answers1

1

Use multiple sum() over() with different partition by clause:

select
  family_id,
  shopper_id,
  total_level_type,
  id,
  total_quantity_individual,
  total_quantity_family
from
(
  select 
      family_id, 
      shopper_id,
      array(
        NAMED_STRUCT('id', product, 
                     'total_level_type', 'product',
                     'total_quantity_individual', sum(quantity) OVER (PARTITION BY family_id, shopper_id, product),
                     'total_quantity_family', sum(quantity) OVER (PARTITION BY family_id, product)
                     ),
        NAMED_STRUCT('id', category, 
                     'total_level_type', 'category',
                     'total_quantity_individual', sum(quantity) OVER (PARTITION BY family_id, shopper_id, category),
                     'total_quantity_family', sum(quantity) OVER (PARTITION BY family_id, category)
                     ),
        NAMED_STRUCT('id', department, 
                     'total_level_type', 'department', 
                     'total_quantity_individual', sum(quantity) OVER (PARTITION BY family_id, shopper_id, department),
                     'total_quantity_family', sum(quantity) OVER (PARTITION BY family_id, department)
                     )
      ) AS array_structs
  from
    transaction
)
lateral view inline(array_structs) exploded
order by
  total_level_type
alhazen
  • 1,907
  • 3
  • 22
  • 43
leftjoin
  • 36,950
  • 8
  • 57
  • 116
  • I adapted your query, but can't edit your post for some reason. – alhazen Dec 02 '20 at 21:42
  • The final table contains multiple transactions for the same family_id, shopper_id, product. Example ("A, 1, "Kit Kat") appears twice (though the totals are correct). So a group by is necessary I think. – alhazen Dec 02 '20 at 21:43
  • Ok updated your post. However, I couldn't get the GROUP BY to work. – alhazen Dec 02 '20 at 21:48