1

I have transaction data table which is in user level, where user buy different products day to day. I want to aggregate the total transaction count for per user by pivoting the columns with same prefix. meaning, per user, total time of transaction for pivoted columns. it should be conditioned on date as well.

reproducible data and attempt

here is reproducible data schema:

CREATE OR REPLACE TABLE transaction_tble (
  user_id char(10),
  transaction_date, date,
  catg_prod1_amt int(10),
  catg_prod2_amt int(10),
  catg_prod3_amt int(10),
  sku_item1_qty int(10),
  sku_item2_qty int(10),
  sku_item3_qty int(10)
)

here is the value of the table:

here is the sample data:

user_id,transaction_date,sku_sale_prod1,sku_sale_prod2,sku_sale_prod3,catg_purch_prod3,catg_purch_prod4,catg_purch_prod5
A,1/14/2022,10,0,0,0,11,0
B,1/10/2022,0,18,0,5,11,7
A,1/19/2022,6,18,2,5,0,0
A,1/19/2022,10,18,1,5,11,7
B,1/19/2022,10,18,1,5,11,7
C,1/20/2022,10,18,1,5,11,7
C,1/20/2022,10,18,1,5,11,7
B,4/19/2022,10,18,1,5,11,7
A,5/23/2022,10,18,1,5,11,7
A,7/23/2022,10,18,1,5,11,7

my attempt

I think using pivot operation may be right approach but it doesn't work

CREATE OR REPLACE TABLE trans_view (
    select 
        user_id,
        ( case when transaction_date between '2022-01-01' - INTERVAL 6 MONTH then count(distinct transaction_date) end) as 2022_h1_trans_cnt, 
        ( case when transaction_date between '2022-06-01' - INTERVAL 6 MONTH then count(distinct transaction_date) end) as 2022_h2_trans_cnt,

        -- using pivot here
        ( case 
            when transaction_date between '2022-01-01' - INTERVAL 6 MONTH
            and 
            PIVOT (
                count(distinct transaction_date)
                for sku in (
                    'sku_sale_prod1','sku_sale_prod2','sku_sale_prod3'
                )
            )
            then count(distinct transaction_date) 
            end
        ) as 2022_h1_trans_cnt_sku, 
        -- 
        ( case 
            when transaction_date between '2022-01-01' - INTERVAL 6 MONTH
            and 
            PIVOT (
                count(distinct transaction_date)
                for catg in (
                    'catg_purch_prod3','catg_purch_prod4','catg_purch_prod5'
                )
            )
            then count(distinct transaction_date) 
            end
        ) as 2022_h1_trans_cnt_catg
    from transaction_table
    group by user_id 
)

I looked into pivot function in big query, not sure what went wrong with my logic. Can anyone suggest possible workaround to do this?

desired output

here is my desired output table schema:

CREATE OR REPLACE VIEW transc_view as(
select
    user_id,
    2022_h1_trans_cnt,
    2022_h2_trans_cnt,
    2022_h1_trans_cnt_catg,
    2022_h2_trans_cnt_catg,
    2022_h1_trans_cnt_sku,
    2022_h2_trans_cnt_sku
from transaction_tble
group by user_id
)

here is desired value of the output table:

user_id,2022_1st_6mth_trns_cnt,2022_2nd_6mth_trns_cnt,2022_1st_6mth_trns_cnt_sku,2022_2nd_6mth_trns_cnt_sku,2022_1st_6mth_trns_cnt_catg,2022_2nd_6mth_trns_cnt_catg
A,4,1,3,,3,1
B,3,0,3,0,3,0
C,1,1,1,0,1,0

How can I correctly pivot the table and get distinct transaction date count for this? any thoughts or quick workaround on this?

Bill Karwin
  • 538,548
  • 86
  • 673
  • 828
Hamilton
  • 620
  • 2
  • 14
  • 32
  • Explain why it doesn't work? Does it return error or wrong results or no result at all? – FanoFN Mar 16 '23 at 05:26
  • @FanoFN no results at all, I think I am on the right path, I don't not quite use pivot in sql so don't know what happen. Could you help me with that? – Hamilton Mar 16 '23 at 05:28
  • I'm not familiar with bigquery but there's no `PIVOT` function in MySQL.. if you are using MySQL – FanoFN Mar 16 '23 at 05:30
  • @FanoFN I think logic should be same, I am using bigquery, but looking at pivot usage in sql in general. Any input would be appreciated though – Hamilton Mar 16 '23 at 05:31
  • 1
    Well then perhaps you can start here `( case when transaction_date between '2022-01-01' - INTERVAL 6 MONTH then count(distinct transaction_date) end)` .. try replacing that to something like `SUM( case when transaction_date between '2022-01-01' - INTERVAL 6 MONTH then 1 ELSE 0 end)`. – FanoFN Mar 16 '23 at 05:34
  • That `BETWEEN` seems wrong as well.. shouldn't it be `case when transaction_date between '2022-01-01' - INTERVAL 6 MONTH AND '2022-01-01'`? – FanoFN Mar 16 '23 at 05:44
  • @FanoFN wouldn't that be aggregation of aggregation though and that's not allowed in bq or sql. also I need to pivot the columns. any better ideas to get my desired output? – Hamilton Mar 16 '23 at 05:44
  • It's not an aggregation of aggregation it's `.. THEN 1 ELSE 0 END` as a replacement for the count and not `.. THEN column_with_value ELSE column_with_other_value END`. Not sure if I explain that correctly but you can refer to this example fiddle https://dbfiddle.uk/5RjF7rpi – FanoFN Mar 16 '23 at 06:02
  • I removed the MySQL tag. There are other methods to do pivot tables in MySQL, but they are not relevant to this question if the OP is not using MySQL. I wish folks would not spam tags. – Bill Karwin Mar 16 '23 at 06:09
  • thanks, I figured out solution [here](https://dbfiddle.uk/qHu2wxry) – Hamilton Mar 20 '23 at 22:18

1 Answers1

2

Except 2022_h1_trans_cnt and 2022_h2_trans_cnt in your expected outout, you can consider below.

WITH stacked AS (
  SELECT *, 
        'y' || EXTRACT(YEAR FROM tx_date) || '_' || 
        CASE 
          WHEN EXTRACT(QUARTER FROM tx_date) IN (1, 2) THEN 'h1' -- 1st, 2nd Quarter 
          WHEN EXTRACT(QUARTER FROM tx_date) IN (3, 4) THEN 'h2' -- 3rd, 4th Quarter
        END AS period_name,
        CASE 
          WHEN STARTS_WITH(metric, 'sku_sale_prod') THEN 'trns_cnt_sku' 
          WHEN STARTS_WITH(metric, 'catg_purch_prod') THEN 'trns_cnt_catg'
        END AS metric_name,
    FROM (
      SELECT *, PARSE_DATE('%m/%d/%Y', transaction_date) tx_date
        FROM transaction_table 
     UNPIVOT (value FOR metric IN (
                      catg_purch_prod3, catg_purch_prod4, catg_purch_prod5,
                      sku_sale_prod1, sku_sale_prod2, sku_sale_prod3
                    )
             )
    )
),
aggregation AS (
  SELECT user_id, period_name || '_' || metric_name AS tx_period_metric,
         COUNT(DISTINCT tx_date) cnt
    FROM stacked
   GROUP BY 1, 2
)
SELECT user_id,
       y2022_h1_trns_cnt_sku AS y2022_h1_trns_cnt,
       y2022_h2_trns_cnt_sku AS y2022_h2_trans_cnt,
       * EXCEPT(user_id)
  FROM aggregation
 PIVOT (SUM(cnt) FOR tx_period_metric IN (
                   'y2022_h1_trns_cnt_sku', 'y2022_h2_trns_cnt_sku',
                   'y2022_h1_trns_cnt_catg', 'y2022_h2_trns_cnt_catg'
                 )
       )
;

Query results

enter image description here

  • It looks 2022_h1_trans_cnt and 2022_h2_trans_cnt are same value as that of y2022_h1_trns_cnt_sku, y2022_h2_trns_cnt_sku respectively. Correct me if I'm wrong.
Jaytiger
  • 11,626
  • 2
  • 5
  • 15
  • thanks for your help, this is great. on the top of your solution, if I want `y2022_h1_trns_cnt`, `y2022_h2_trns_cnt` also like I showed in desired output, how would I include that in output table? – Hamilton Mar 16 '23 at 11:30
  • @Hamilton, I've updated my answer. I might have misunderstood your intentions, so be sure to validate the query with your data. – Jaytiger Mar 16 '23 at 11:42
  • And if your aggregation logic is just to be `count(distinct transaction_date)`, the query doesn't need to be complex like this. but I think you're applying another aggregation logic to your real dataset. – Jaytiger Mar 16 '23 at 11:44
  • `2022_h1_trans_cnt` and `2022_h2_trans_cnt` are not as same as `y2022_h1_trns_cnt_sku`, `y2022_h2_trns_cnt_sku`, in sample data they are looks like it, how can we come around this? also in actual data transaction date could be various like `2020-mm-dd`, `2021-mm-dd`, `2022-mm-dd`, how to make `tx_period_metric IN ()` in last `PIVOT` more dynamic instead of hard coding? Could you share possible thoughts on those? – Hamilton Mar 16 '23 at 12:07
  • For your second question, you should consider to use [a dynamic sql](https://cloud.google.com/bigquery/docs/reference/standard-sql/procedural-language#execute_immediate) to generate your query dynamically. – Jaytiger Mar 16 '23 at 12:16
  • sorry but still `2022_h1_trans_cnt` and `2022_h2_trans_cnt` logic is not clear to me.. from your query, it's *case when transaction_date is 1st half then count(distinct transaction_date) end) as 2022_h1_trans_cnt* which is same as other metric like *y2022_h1_trns_cnt_sku* since both metric considers only transaction date in a same row. – Jaytiger Mar 16 '23 at 12:18
  • no worries, I got the way to do it. I got data type error in UNPIVOT in clause, `The datatype of column does not match with other datatypes in the IN clause. Expected FLOAT64, Found INT64`, any way to fix this? Thanks again for the help – Hamilton Mar 16 '23 at 15:01
  • 1
    @Hamilton, All the columns you're trying to unpivot should have same type. So check the type of each column and make them same by casting. For example, `CAST(int_column AS FLOAT64)` – Jaytiger Mar 16 '23 at 15:07
  • all the columns have to be same type. in actual data, some columns are in FLOAT, some are in INT; so UNPIVOT expect columns with same data type? – Hamilton Mar 16 '23 at 15:14
  • I did [some queries](https://dbfiddle.uk/bEJ0L5sL) to produce expected output in better way, could you take a look and share your thoughts if possible? thank you – Hamilton Mar 19 '23 at 05:37
  • I think Barmar explained well in his answer why you need to unpivot your input table first to avoid multiple **sum** aggregations. And you insist the expected schema, you should pivot the result of aggregations as I've already shown you. – Jaytiger Mar 19 '23 at 07:18
  • I agree with you. I don't need old expected schema, just do sum of each colums values in 6 months then group by user_id, purch_yr and period, do you mind to show pivot solution in another new post so I can close this question? – Hamilton Mar 19 '23 at 09:53
  • 1
    You can find many post related to *pivot* and also *dynamic sql*. kindly see below links. https://stackoverflow.com/questions/73711209/pivoting- bigquery-tables-on-date-type https://stackoverflow.com/questions/72288319/can-pivot-in-bigquery-generate-a-total-column-aggregated-across-all-dimension https://stackoverflow.com/questions/75688365/bigquery-unnest-and-pivot-column https://stackoverflow.com/questions/74327529/bigquery-best-way-to-transpose-rows-into-multiple-columns – Jaytiger Mar 19 '23 at 11:11