1

I'm banging my head for several hours now, trying to figure out how I could count the number of total newsletter subscribers per month using Redshift.

Base for the calculation is a events table that tracks every user action, specifically wether he subscribes or unsubscribes from the newsletter. Simplified it looks like this:

+----------------------+---------+---------------+
| timestamp            | user_id | action        |
+----------------------+---------+---------------+
| 2017-01-01T12:10:31Z |       1 | subscribed    |
| 2017-01-01T13:11:51Z |       2 | subscribed    |
| 2017-01-01T13:15:53Z |       3 | subscribed    |
| ...                  | ...     | ...           |
| 2017-02-17T09:42:33Z |       4 | subscribed    |
| ...                  | ...     | ...           |
| 2017-03-15T16:59:13Z |       1 | unsubscribed  |
| 2017-03-17T02:19:56Z |       2 | unsubscribed  |
| 2017-03-17T05:33:05Z |       2 | subscribed    |
| ...                  | ...     | ...           |

For every month I like to sum up the number of users that subscribed to the newsletter plus the number of users that were already subscribed and didn't unsubscribe. In the example above we would have 3 users in January, another one is added in February for a total of 4 subscribers. Then in March we lose one user while another one only unsubscribes temporarily. The number of total subscribers we have in March is 3.

The end result I'm looking for is something like this:

+------------+-------------+
| month      | subscribers |
+------------+-------------+
| 2017-01-01 |           3 |
| 2017-02-01 |           4 |
| 2017-03-01 |           3 |
| ...        |         ... |

Any ideas if and how this would be solvable using an SQL query (preferably working in Redshift or Postgres)?

André
  • 2,101
  • 20
  • 23
  • its important to know whether this is for redshift or postgres as they are quite different at advanced levels! which is it? please delete one tag? – Jon Scott Jan 08 '18 at 08:34

4 Answers4

1

You can use a recursive CTE to create each of the desired months. Then match subscriptions with unsubscriptions (another CTE for simplicity). Note the lateral join used to pick the top 1 matching unsubscribe. Finally, get the count of distinct user_id's for each month.

This is Postgres. Here is the SQL Fiddle where you can run this, adjust the data set, etc.

WITH RECURSIVE months(start_date, end_date) AS (
    select 
      timestamp '2017-01-01',/*change this date to adjust range*/
      (date_trunc('MONTH', timestamp '2017-01-01') + INTERVAL '1 MONTH')::DATE/*change this date to adjust range*/
  UNION ALL
    SELECT 
      start_date + interval '1 month',
      (date_trunc('MONTH', start_date + interval '1 month') + INTERVAL '1 MONTH')::DATE
    FROM 
      months 
    WHERE 
      start_date < timestamp '2017-12-01' /*change this date to adjust range*/
),

subscription_months(start_date, end_date, user_id) as(
  select
    months.start_date::DATE,
    months.end_date,
    initial_subscription.user_id
   from
     subscription initial_subscription
     left join lateral ( 
       select
         cancellation.timestamp
       from
         subscription cancellation
       where
         cancellation.user_id = initial_subscription.user_id
         and cancellation.timestamp >= initial_subscription.timestamp
         and cancellation.action = 'unsubscribed'
        order by
         cancellation.timestamp asc
        limit 1 
     ) as cancellation on true
     inner join months on
       initial_subscription.timestamp <= months.end_date
       and (
           cancellation.timestamp is null
           or cancellation.timestamp >= months.end_date
       )
    where
      initial_subscription.action = 'subscribed'
 )

 select 
   start_date,
   end_date,
   count(distinct user_id)
 from 
   subscription_months
 group by
   start_date,
   end_date
 order by 
   start_date
Robert Harris
  • 482
  • 2
  • 6
  • Nice! This gives the exact results as expected. I've tried this against my Postgres version of the data and it _looks_ pretty accurate. I haven't checked all the values yet though. Thanks a lot! Any thoughts how we could avoid a `LATERAL` join though? I don't see any way of replicating this in Redshift. – André Jan 06 '18 at 15:58
  • 1
    Here is another version without the lateral join. It uses another CTE to avoid it. http://sqlfiddle.com/#!17/fe3c7/59 What do you think? – Robert Harris Jan 06 '18 at 16:30
1

This seems to require bunch of joins that could take a long time to converge depending on your table size. If space was not an issue and these type of queries were frequent, I would add a third column with a (binary) flag that shows latest action which you can filter on. My attempt: SQL Fiddle

-- get starting month
WITH start_month AS(
  SELECT MIN(CAST(DATE_TRUNC('month', ts) AS DATE)) AS earliest
FROM test
),
-- bucket each date into months
month_buckets AS(
  SELECT CAST(DATE_TRUNC('month', ts) AS DATE) AS month_bucket
  FROM test
  GROUP BY 1
),
-- for each month bucket, find all actions taken by each user upto that month
master AS (SELECT mb.month_bucket, user_id, actions, ts
FROM month_buckets mb
LEFT JOIN test
ON CAST(DATE_TRUNC('month', test.ts) AS DATE) <= mb.month_bucket
)
-- for each user, get the latest action and timestamp
-- group by month_bucket, count
SELECT m1.month_bucket AS month, 
    COUNT(m1.user_id) AS subscribers
FROM master m1
JOIN (SELECT month_bucket, user_id, MAX(ts) AS ts
    FROM master
    GROUP BY 1, 2
    ) m2
ON m1.month_bucket = m2.month_bucket
AND m1.user_id = m2.user_id
AND m1.ts = m2.ts
AND m1.actions = 'subscribed'
GROUP BY 1
ORDER BY 1;
jimmu
  • 1,025
  • 1
  • 10
  • 15
  • This seems to work well. Thanks a lot for your help! I'm going to validate this against my actual data to see if I find any discrepancies. – André Jan 06 '18 at 16:07
1

The solution is:

1) create a calendar table that stores dates (a row in the table is a unique date), see more info in this question. This is very handy for most of BI queries.

2) write the query that consists of the following steps:

2a) based on subscribe/unsubscribe events, construct time ranges of subscribed status for every user (first identify the next event for every given event using lead function and pick up the necessary pairs). If the user has only one subscribe event, set date_to to current date using coalesce.

2b) join these ranges to the calendar table so a row is a date/user

2c) count the rows using one or another method (unique ID, average daily, first date of month, last date of month)

the query would look like this:

with
next_events as (
    select
     user_id
    ,"timestamp"::date as date_from
    ,action
    ,lead(timestamp) over (partition by user_id order by timestamp) ::date as date_to
    ,lead(action) over (partition by user_id order by timestamp) as next_action
    from your_table
    where action in ('subscribed','unsubscribed')
)
,ranges as (
    select 
     user_id
    ,date_from
    ,coalesce(date_to,current_date) as date_to
    from next_events
    where (action='subscribed' and next_action='unsubscribed')
    or (action='subscribed' and next_action is null)
)
,subscriber_days as (
    select 
     t1.user_id
    ,t2.date
    from ranges t1
    join calendar t2
    on t2.date between t1.date_from and t1.date_to
)
-- use whatever method needed to identify monthly N from daily N (first day, last day, average, etc.)
-- below is the unique count
select
 date_trunc('month',date) as date
,count(distinct user_id) as subscribers
from subscriber_days
group by 1
order by 1
André
  • 2,101
  • 20
  • 23
AlexYes
  • 4,088
  • 2
  • 15
  • 23
  • Thanks for the hints. I've played with this query a bit and generally it seems to work. But it's not accurate as shown in this SQLFiddle: http://sqlfiddle.com/#!17/b7ace/1 March seems to be off by one for example. I've added a custom calendar generating function based on some other sources for testing. – André Jan 06 '18 at 15:54
  • @André what's wrong with March? it's 7 and it seems so based on data if you count unique subscribers at any point of time during the month. User id=1 unsubscribed at the mid of the month but it's counted – AlexYes Jan 06 '18 at 17:26
  • Oh, I see now how you count this. Yes, that makes sense. You're absolutely right. Thanks! – André Jan 06 '18 at 17:42
  • Great! please accept the answer. Also you can calculate the daily counts first and then average them to get the average user count at any point of time during the month. The higher is the churn/growth, the bigger is the difference with unique count (if your user base is completely refreshed once a month the avg will be 2 times lower than the unique count) – AlexYes Jan 06 '18 at 18:15
  • I will accept this as an answer for two reasons: 1) it explains the necessary steps to take so a newcomer can understand what needs to be done, 2) it gives accurate numbers that I could validate manually against my expectations. Other answers had slight variations I couldn't figure out immediately. In the end I've even came up with my own approach by simply joining the subscriptions table with itself to figure out who subscribed but didn't unsubscribe yet for every given month. I'm happy to share the solution if anybody cares. – André Jan 06 '18 at 20:05
  • can you elaborate on the line `where (action='subscribed' and next_action='unsubscribed') or (action='subscribed' and next_action is null)`? From my understanding this isn't necessary and can be simplified to `where (action='subscribed')`. – André Jan 06 '18 at 20:12
  • @André that is just to make the description of possible scenarios more verbose ("subscribed then unsubscribed" or "subscribed once and still subscribed"), it's totally fine to reduce that – AlexYes Jan 06 '18 at 20:48
0

the total number of subscribed users is:

select count(*) 
from
(
  select distinct id
  from subscribers
  group by id
  having count(*) in  (1, 3, 5...) -- here you can use a table function to return odd numbers
) a

the number of subscribed in some period of time:

select count(distinct a.id) 
from
(
  select distinct id
  from subscribers
  group by id
  having count(*)  in  (1, 3, 5...) -- here you can use a table function to return odd numbers
) a join
 subscribers s on a.id = s.id
where timestamp between @date1 and @date2

note: i didn't try it in Redshift or Postgres

Dejan Dozet
  • 948
  • 10
  • 26