0

Assume I have the following table: http://sqlfiddle.com/#!17/3de963/1

What I am trying to calculate is for each paidmonth in the table, I want to calculate total users and total purchase_amt for that month and 11 months prior (so a total of 12 months, including the month of the current row).

I can calculate the total amount easily and I have done this by doing:

sum(purchase_amt) over (order by paidmonth asc rows 11 preceding)

However, when I try to do:

count(distinct user_id) over (order by paidmonth asc rows 11 preceding)

I get this error:

Window ORDER BY is not allowed if DISTINCT is specified

So this is the result I am hoping to get:

| paidmonth  | total_unique_users | total_amount |
| ---------- | ------------------ | ------------ |
| 2020-10-01 | 1                  | 20           |
| 2020-11-01 | 1                  | 50           |
| 2020-12-01 | 1                  | 100          |
| 2021-06-01 | 2                  | 180          |
| 2022-03-01 | 2                  | 85           |
| 2022-06-01 | 1                  | 105          |
| 2022-10-01 | 2                  | 175          |

If there are any additional columns you require, please let me know and I will help. The table I have shown in the link is a summary CTE.

jqlifer1
  • 57
  • 5
  • I think there are some inconsistencies between the sample data in the sqlfiddle and your expected output here. In any case, try this using `group by`: `select paidmonth, count(distinct user_id) unique_users, sum(purchase_amt) total_purchase_amt from test group by paidmonth order by paidmonth desc limit 12;` – Ionuț G. Stan Oct 14 '22 at 06:27
  • @a_horse_with_no_name apologies I corrected it. – jqlifer1 Oct 15 '22 at 03:17

2 Answers2

1

The trick is to turn your list of users into array and then do the calculation unnesting it. Hope this helps.

with temp_table as (
  select '2020-10-01' paidmonth, 23 user_id, 392 order_id, 20 purchase_amt union all
  select '2020-11-01', 23, 406, 30 union all
  select '2020-12-01', 23, 412, 50 union all
  select '2021-06-01', 32, 467, 80 union all
  select '2022-03-01', 87, 449, 5 union all
  select '2022-06-01', 87, 512, 100 union all
  select '2022-10-01', 87, 553, 50 union all
  select '2022-10-01', 155, 583, 20
),
  calcs AS (
  SELECT
    paidmonth,
    purchase_amt,
    ARRAY_AGG(user_id) OVER(ORDER BY paidmonth ASC ROWS 11 PRECEDING ) AS last_11_unique_users
  FROM
    temp_table )
SELECT
  paidmonth,
  (SELECT COUNT(DISTINCT users) FROM UNNEST(last_11_unique_users) AS users) total_unique_users,
  SUM(purchase_amt) OVER (ORDER BY paidmonth ASC ROWS 11 PRECEDING) total_amount
FROM
  calcs
AlienDeg
  • 1,288
  • 1
  • 13
  • 23
  • Thank you for responding. I think you are very close with regards to your concept. I tried applying the above concept to my original dataset which has about 2 million rows and I end up getting 12 total_unique_users for every month. Exactly 12 for every month. I think what is happening is that in your calcs CTE, the array_agg is aggregating 12 users in batches even though there are > 12 users in a month. – jqlifer1 Oct 15 '22 at 03:26
  • I tried doing: ```array_agg(user_id) over (partition by month_paid``` And then doing the ```select count(distinct users)...``` but it is taking alot of resources and will not run. – jqlifer1 Oct 15 '22 at 04:21
0

Thank you to AlienDeg, he gave me insights into how I should approach this. I think I finally got my answer by doing some trial and error approach and then almost following the solution here: Count unique ids in a rolling time frame

Now what I had to end up changing was that I could not use unix_date(paidmonth) range between 11 preceding and current row as mentioned in the solution in the link above. This is because, if I did that I would just get count distinct of users per month as I am using range between.

So I ended up doing:

cte1 as (
select
  paidmonth,
  string_agg(distinct user_id) users
from tempt1 -- this is just a table of distinct month and user_ids, where user_ids have been cast as string
group by paidmonth
),

cte2 as (
select
  paidmonth,
  string_agg(users) over (order by month_paid rows between 11 preceding and current row) users_12m
from cte1
)

select 
  paidmonth,
  (select count(distinct id) from unnest(split(users_12m)) as id) unique_12m
from cte2
order by 1 asc;

So instead of using unix_date and range, I just used order by month and rows.

jqlifer1
  • 57
  • 5