0

I have the following transactions table:

customer_id purchase_date   product         category        department      quantity    store_id
    1       2020-10-01      Kit Kat         Candy           Food                2       store_A
    1       2020-10-01      Snickers        Candy           Food                1       store_A
    1       2020-10-01      Snickers        Candy           Food                1       store_A
    2       2020-10-01      Snickers        Candy           Food                2       store_A
    2       2020-10-01      Baguette        Bread           Food                5       store_A
    2       2020-10-01      iPhone          Cell phones     Electronics         2       store_A
    3       2020-10-01      Sony PS5        Games           Electronics         1       store_A

I would like to calculate the average number of products purchased (for each product in the table). I'm also looking to calculate averages across each category and each department by accounting for all products within the same category or department respectively. Care should be taken to divide over unique customers AND the product quantity being greater than 0 (a 0 quantity indicates a refund, and should not be accounted for).

So basically, the output table would like below:

enter image description here

...where store_id and average_level_type are partition columns.

Is there a way to achieve this in a single pass over the transactions table? or do I need to break down my approach into multiple steps?

Thanks!

alhazen
  • 1,907
  • 3
  • 22
  • 43

1 Answers1

2

How about using “union all” as below -

Select store_id, 'product' as average_level_type,product as id, sum(quantity) as total_quantity,
Count(distinct customer_id) as unique_customer_count, sum(quantity)/count(distinct customer_id) as average
from transactions
where quantity > 0
group by store_id,product
Union all
Select store_id, 'category' as average_level_type, category as id, sum(quantity) as total_quantity,
Count(distinct customer_id) as unique_customer_count, sum(quantity)/count(distinct customer_id) as average
from transactions
where quantity > 0
group by store_id,category
Union all
Select store_id, 'department' as average_level_type,department as id, sum(quantity) as total_quantity,
Count(distinct customer_id) as unique_customer_count, sum(quantity)/count(distinct customer_id) as average
from transactions
where quantity > 0
group by store_id,department;

If you want to avoid using union all in that case you can use something like rollup() or group by grouping sets() to achieve the same but the query would be a little more complicated to get the output in the exact format which you have shown in the question.

EDIT : Below is how you can use grouping sets to get the same output -

Select store_id,
       case when G_ID = 3 then 'product' 
            when G_ID = 5 then 'category'
            when G_ID = 6 then 'department' end As average_level_type,
       case when G_ID = 3 then product 
            when G_ID = 5 then category
            when G_ID = 6 then department end As id,
       total_quantity,
       unique_customer_count,
       average
from            
    (select store_id, product, category, department, sum(quantity) as total_quantity, Count(distinct customer_id) as unique_customer_count, sum(quantity)/count(distinct customer_id) as average, GROUPING__ID As G_ID
    from transactions
    group by store_id,product,category,department
    grouping sets((store_id,product),(store_id,category),(store_id,department))
    ) Tab
order by 2    
;
Somy
  • 1,474
  • 1
  • 4
  • 13
  • @alhazen - The above approach work for you? Please share comments/thoughts. – Somy Oct 21 '20 at 15:45
  • I'd like to avoid the union approach. I'll check out grouping sets. Thanks. – alhazen Oct 21 '20 at 17:35
  • @alhazen - See my edit above. I have added how to write the query using grouping sets function to get the desired output. – Somy Oct 21 '20 at 19:21
  • Looks great! thank you. Do you know how to adapt the query so that the results are written to an output table whose partition columns are `store_id` and `average_level_type`? – alhazen Oct 21 '20 at 19:33
  • I made an error in my question. The average for each category is per department/category, and not global category. – alhazen Oct 21 '20 at 19:35
  • @alhazen - The table in which you want to insert the output/results of the query above, while creating that table (in the DDL statement) you have to define the partition columns using Hive syntax. I can update my answer to add that syntax for your reference. – Somy Oct 21 '20 at 19:37
  • Hmm, in that case, would recommend you to create a new question. Thanks!! – Somy Oct 21 '20 at 19:40
  • I'm familiar with the syntax to create a partitioned table. I'm wondering how will your query will work, writing to multiple partitions dynamically? – alhazen Oct 21 '20 at 19:41
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/223420/discussion-between-somy-and-alhazen). – Somy Oct 21 '20 at 19:48