Cohort analysis using pgsql/activerecord

Question

I'm performing a cohort analysis on a single table messages. I need to calculate the retention rates of users that created a message (day_0), also created a message on the following day, day after, etc (day_1, day_2, etc).

I was previously doing most of the processing post-query in ruby iterations. Now I have larger tables to deal with. It's way too slow and memory intensive in ruby so I need to offload the heavy lifting to the DB. I've also tried the cohort_me gem and experienced poor performance.

I don't have much experience with SQL w/out activerecord. Here's what I have so far:

SELECT 
date_trunc('day', messages.created_at) as day,
count(distinct messages.user_id) as day_5_users
FROM 
messages
WHERE 
messages.created_at >= date_trunc('day', now() - interval '5 days') AND 
messages.created_at < date_trunc('day', now() - interval '4 days')
GROUP BY 1
ORDER BY 1;

This returns the count of users who created messages five days ago. Now I need to find the count of THOSE users who created messages the following day, day after that, etc. until the current day.

I need to perform this same analysis on different base days. So next instead of 5 days go, it starts the analysis at 4 days ago as the base day.

Can this be done with one query?

EDIT: messages.user_id is not actually a key to a different table. It's simply a unique identifier (string), so there's no other tables to be joined with this query.

Paul A Jungwirth · Accepted Answer · 2016-06-28T21:41:46.050

Heap Analytics has a nice blog post about lateral joins for doing something pretty similar. It might give you some ideas. Your situation is actually simpler than theirs though, so your solution is easier too.

First a few notes. You don't seem to need the day output, since it will always equal your inputs. Second, no matter what, you will need a separate output column for each day (or accumulate results in an array, which seems less desirable), so if you want a variable number of days, you'll have to dynamically build the SQL for that.

For testing I made a table and gave it a few rows:

create table messages (user_id integer, created_at timestamp);
insert into messages values (1, now() - interval '5 days'), (1, now() - interval '4 days'), (1, now() - interval '2 days');
insert into messages values (2, now() - interval '10 days'), (2, now() - interval '2 days');
insert into messages values (3, now() - interval '2 days'), (3, now() - interval '1 days');
insert into messages values (4, now() - interval '5 days');

I think you can get a very clean solution using lateral joins, kind of like the article above:

\set start_time '''2016-06-23 06:00:00'''

WITH t(s) AS (
  SELECT  :start_time::timestamp
)
SELECT  COUNT(DISTINCT m1.user_id) AS day_5_messages,
        COUNT(DISTINCT m2.user_id) AS day_4_messages,
        COUNT(DISTINCT m3.user_id) AS day_3_messages,
        COUNT(DISTINCT m4.user_id) AS day_2_messages,
        COUNT(DISTINCT m5.user_id) AS day_1_messages
FROM    messages m1
CROSS JOIN t
LEFT OUTER JOIN LATERAL (
    SELECT * FROM messages msub
    WHERE msub.user_id = m1.user_id
    AND msub.created_at <@
      tsrange(t.s + interval '1 day',
              t.s + interval '2 days')
    LIMIT 1
) m2
ON true
LEFT OUTER JOIN LATERAL (
    SELECT * FROM messages msub
    WHERE msub.user_id = m2.user_id
    AND msub.created_at <@
      tsrange(t.s + interval '2 days',
              t.s + interval '3 days')
    LIMIT 1
) m3
ON true
LEFT OUTER JOIN LATERAL (
    SELECT * FROM messages msub
    WHERE msub.user_id = m3.user_id
    AND msub.created_at <@
      tsrange(t.s + interval '3 days',
              t.s + interval '4 days')
    LIMIT 1
) m4
ON true
LEFT OUTER JOIN LATERAL (
    SELECT * FROM messages msub
    WHERE msub.user_id = m4.user_id
    AND msub.created_at <@
      tsrange(t.s + interval '4 days',
              t.s + interval '5 days')
    LIMIT 1
) m5
ON true
WHERE   m1.created_at <@
  tsrange(t.s,
          t.s + interval '1 day')
;

Here I am using the t(s) CTE just to avoid repeating :start_time again and again. It is optional if you don't like it. Also naturally in Rails you would use ? instead of :start_time to parameterize the query.

For testing it is helpful to replace each COUNT(...) with array_agg(...) so you can decide if the right user_ids are included or not.

I think this should perform well if you have an index on created_at and user_id (together). Or if your days will always start at the same instant (say midnight UTC), then you could use a functional index with just the date (not timestamp) and user_id, and then replace all the range conditions with simply being that day. That will perform even better.

Oh also: your query (and mine) always return just one row, which seems pretty suspicious. I wonder if that is really what you want, or if that's just an accident of simplifying things for your question. If you wanted one row per starting day, then you could put your day column back, group by it, remove my WHERE condition, and do all the joins based on the previous m table instead of t.s.

score 0 · Answer 2 · edited May 23 '17 at 12:08

0

Based on the lack of a foreign key, I would try and first put the messages into ranges. See this post: In SQL, how can you “group by” in ranges? using between times. Check if a time is between two times (time DataType) and then GROUP BY messages.user_id

edited May 23 '17 at 12:08

Community

1
1

answered Jun 28 '16 at 15:06

Alex Harris

6,172
2
32
57

I probably should of specified, but `user_id` is not actually a key for another table. It's just a unique string identifier. – mnort9 Jun 28 '16 at 15:08
Just curious, why not have a foreign key on that? – Alex Harris Jun 28 '16 at 15:18
The field is not actually `user_id` in my db, I just used that as an example for this post. Probably a bad example on my part b/c it looks like a foreign key – mnort9 Jun 28 '16 at 15:22
@shane I edited my post with another idea based on your feedback. – Alex Harris Jun 28 '16 at 16:53

Cohort analysis using pgsql/activerecord

2 Answers2