Rolling counts based on rolling cohorts

Question

Using Postgres 9.5. Test data:

create temp table rental (
    customer_id smallint
    ,rental_date timestamp without time zone
    ,customer_name text
);

insert into rental values
    (1, '2006-05-01', 'james'),
    (1, '2006-06-01', 'james'),
    (1, '2006-07-01', 'james'),
    (1, '2006-07-02', 'james'),
    (2, '2006-05-02', 'jacinta'),
    (2, '2006-05-03', 'jacinta'),
    (3, '2006-05-04', 'juliet'),
    (3, '2006-07-01', 'juliet'),
    (4, '2006-05-03', 'julia'),
    (4, '2006-06-01', 'julia'),
    (5, '2006-05-05', 'john'),
    (5, '2006-06-01', 'john'),
    (5, '2006-07-01', 'john'),
    (6, '2006-07-01', 'jacob'),
    (7, '2006-07-02', 'jasmine'),
    (7, '2006-07-04', 'jasmine');

I am trying to understand the behaviour of existing customers. I am trying to answer this question:

What is the likelihood of a customer to order again based on when their last order was (current month, previous month (m-1)...to m-12)?

Likelihood is calculated as:

distinct count of people who ordered in current month /
distinct count of people in their cohort.

Thus, I need to generate a table that lists a count of the people who ordered in the current month, who belong in a given cohort.

Thus, what are the rules for being in a cohort?

- current month cohort: >1 order in month OR (1 order in month given no previous orders)
- m-1 cohort: <=1 order in current month and >=1 order in m-1
- m-2 cohort: <=1 order in current month and 0 orders in m-1 and >=1 order in m-2
- etc

I am using the DVD Store database as sample data to develop the query: http://linux.dell.com/dvdstore/

Here is an example of cohort rules and aggregations, based on July being the "month's orders being analysed" (please notice: the "month's orders being analysed" column is the first column in the 'Desired output' table below):

customer_id | jul-16| jun-16| may-16|
------------|-------|-------|-------|
james       | 1  1  | 1     | 1     | <- member of jul cohort, made order in jul
jasmine     | 1  1  |       |       | <- member of jul cohort, made order in jul
jacob       | 1     |       |       | <- member of jul cohort, did NOT make order in jul
john        | 1     | 1     | 1     | <- member of jun cohort, made order in jul
julia       |       | 1     | 1     | <- member of jun cohort, did NOT make order in jul
juliet      | 1     |       | 1     | <- member of may cohort, made order in jul
jacinta     |       |       | 1 1   | <- member of may cohort, did NOT make order in jul

This data would output the following table:

--where m = month's orders being analysed

month's orders |how many people |how many people from  |how many people   |how many people from    |how many people   |how many people from    |
being analysed |are in cohort m |cohort m ordered in m |are in cohort m-1 |cohort m-1 ordered in m |are in cohort m-2 |cohort m-2 ordered in m |...m-12
---------------|----------------|----------------------|------------------|------------------------|------------------|------------------------|
may-16         |5               |1                     |                  |                        |                  |                        |
jun-16         |                |                      |5                 |3                       |                  |                        |
jul-16         |3               |2                     |2                 |1                       |2                 |1                       |

My attempts so far have been on variations of:

generate_series()

and

row_number() over (partition by customer_id order by rental_id desc)

I haven't been able to get everything to come together yet (I've tried for many hours and haven't yet solved it).

For readability, I think posting my work in parts is better (if anyone wants me to post the sql query in its entirety please comment - and I'll add it).

series query:

(select
    generate_series(date_trunc(‘month’,min(rental_date)),date_trunc(‘month’,max(rental_date)),’1 month)) as month_being_analysed
from
    rental) as series

rank query:

(select
    *,
    row_number() over (partition by customer_id order by rental_id desc) as rnk
from
    rental
where
    date_trunc('month',rental_date) <= series.month_being_analysed) as orders_ranked

I want to do something like: run the orders_ranked query for every row returned by the series query, and then base aggregations on each return of orders_ranked.

Something like:

(--this query counts the customers in cohort m-1
select
    count(distinct customer_id)
from
    (--this query ranks the orders that have occured <= to the date in the row of the 'series' table
    select
        *,
        row_number() over (partition by customer_id order by rental_id desc) as rnk
    from
        rental
    where
        date_trunc('month',rental_date)<=series.month_being_analysed) as orders_ranked
where
    (rnk=1 between series.month_being_analysed - interval ‘2 months’ and series.month_being_analysed - interval ‘1 months’)
    OR
    (rnk=2 between series.month_being_analysed - interval ‘2 months’ and series.month_being_analysed - interval ‘1 months’)
) as people_2nd_last_booking_in_m_1,


(--this query counts the customers in cohort m-1 who ordered in month m
select
    count(distinct customer_id)
from
    (--this query returns the orders by customers in cohort m-1
    select
        count(distinct customer_id)
    from
        (--this query ranks the orders that have occured <= to the date in the row of the 'series' table
        select
            *,
            row_number() over (partition by customer_id order by rental_id desc) as rnk
        from
            rental
        where
            date_trunc('month',rental_date)<=series.month_being_analysed) as orders_ranked
    where
        (rnk=1 between series.month_being_analysed - interval ‘2 months’ and series.month_being_analysed - interval ‘1 months’)
        OR
        (rnk=2 between series.month_being_analysed - interval ‘2 months’ and series.month_being_analysed - interval ‘1 months’)
where
    rnk=1 in series.month_being_analysed
) as people_who_booked_in_m_whose_2nd_last_booking_was_in_m_1,
...
from
    (select
        generate_series(date_trunc(‘month’,min(rental_date)),date_trunc(‘month’,max(rental_date)),’1 month)) as month_being_analysed
    from
        rental) as series

Please provide the exact table definition with data types and all constraints for the relevant columns `rental_id`, `customer_id`, `rental_date`. Ideally as valid `CREATE TABLE`statement. (I am not going to extract it from the linked archive myself.) And, *always*, your version of Postgres. — Erwin Brandstetter, Jul 16 '16 at 17:42
Also: `current month cohort: >1 order in month`. I suspect that's supposed to be **`>=`** in all instances of your definition? Please clarify. And: cardinalities? How many orders per customer and month: min, max, avg? — Erwin Brandstetter, Jul 16 '16 at 17:47
@ErwinBrandstetter I have added test data (.tar file for the dvdrental db, and the relevant commands) - I tried to just do a dump of the table, but ran into problems - hopefully what I've added suffices. You were correct, the cohort should have been `>=`, I have added a table that shows the cohort rules in action - hopefully this clarifies things. I've added the Postgres version to the top - 9.5. As far as data volume goes - single digit millions of rows. Customers per month: hundreds of thousands, orders per month avg <2. — Sean, Jul 17 '16 at 06:33
changed the test data from the .tar file to an insert command. — Sean, Jul 17 '16 at 11:37
I appreciate you made an effort to clarify your case. But I still stopped short at a contradiction. You write: `jacob ... member of jul cohort, did NOT make order in jul`, but he *did*. I guess you mean *no order in jun* instead of *jul*, but I guessed the first time and wasted quite some time trying to help. You need to present a consistent question for such a sophisticated case. As for codementor .. interesting, didn't know the site. I do paid consulting work without arbiter, though. — Erwin Brandstetter, Jul 24 '16 at 02:08
@ErwinBrandstetter thanks for your reply, and thanks for all of the effort that you put into your answer. For clarity `jacob ... member of jul cohort, did NOT make order in jul` - is correct. He counts for the propensity denominator, but not for the numerator (to be part of the numerator he would need to have made >= 2 orders in his life as a customer). As part of the paid consulting work that you do, is there any chance that you would be open to doing some tutoring? — Sean, Jul 24 '16 at 07:33
yeah, you're correct - `did NOT make...` is misleading - it should read `does not count...` or something similar - I'll ponder this :-/ I'll reach out to you at that email :-) — Sean, Jul 26 '16 at 08:07

score 2 · Answer 1 · edited May 23 '17 at 11:51

This query does everything. It operates on the whole table and works for any time range.

Based on some assumptions and assuming current Postgres version 9.5. Should work with pg 9.1 at least. Since your definition of "cohort" is unclear to me, I skipped the "how many people in cohort" columns.

I would expect it to be faster than anything you tried so far. By orders of magnitude.

SELECT *
FROM   crosstab (
   $$
   SELECT mon
        , sum(count(*)) OVER (PARTITION BY mon)::int AS m0
        , gap   -- count of months since last order
        , count(*) AS gap_ct
   FROM  (
      SELECT mon
           , mon_int - lag(mon_int) OVER (PARTITION BY c_id ORDER BY mon_int) AS gap
      FROM  (
         SELECT DISTINCT ON (1,2)
                date_trunc('month', rental_date)::date AS mon
              , customer_id                            AS c_id
              , extract(YEAR  FROM rental_date)::int * 12
              + extract(MONTH FROM rental_date)::int   AS mon_int
         FROM   rental
         ) dist_customer
      ) gap_to_last_month
   GROUP  BY mon, gap
   ORDER  BY mon, gap
   $$
 , 'SELECT generate_series(1,12)'
   ) ct (mon date, m0 int
       , m01 int, m02 int, m03 int, m04 int, m05 int, m06 int
       , m07 int, m08 int, m09 int, m10 int, m11 int, m12 int);

Result:

    mon     | m0 | m01 | m02 | m03 | m04 | m05 | m06 | m07 | m08 | m09 | m10 | m11 | m12
------------+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----
 2015-01-01 | 63 |  36 |  15 |   5 |   3 |   3 |     |     |     |     |     |     |
 2015-02-01 | 56 |  35 |   9 |   9 |   2 |     |   1 |     |     |     |     |     |
...

m0 .. customers with >= 1 order this month
m01 .. customers with >= 1 order this month and >= 1 order 1 month before (nothing in between)
m02 .. customers with >= 1 order this month and >= 1 order 2 month before and no order in between
etc.

How?

In subquery dist_customer reduce to one row per month and customer_id (mon, c_id) with DISTINCT ON:
- Select first row in each GROUP BY group?
To simplify later calculations add a count of months for the date (mon_int). Related:
- How do you do date math that ignores the year?
If there are many orders per (month, customer), there are faster query techniques for the first step:
- Optimize GROUP BY query to retrieve latest record per user
In subquery gap_to_last_month add the column gap indicating the time gap between this month and the last month with any orders of the same customer. Using the window function lag() for this. Related:
- PostgreSQL window function: partition by comparison
In the outer SELECT aggregate per (mon, gap) to get the counts you are after. In addition, get the total count of distinct customers for this month m0.
Feed this query to crosstab() to pivot the result into the desired tabular form for the result. Basics:
- PostgreSQL Crosstab Query
About the "extra" column m0:
- Pivot on Multiple Columns using Tablefunc

thanks for posting this! I've spent the last couple of hours going through those links, and will continue going through them over the next couple of days to make sure that I understand each of them. I've updated my question which should help makes things more clear. As it stands, what is posted doesn't answer the question. If you have some time, it would be really great if you could please revisit my question with the updates that I've made. Regardless, I'm hoping that once I understand your query, that I'll be able to use it as a base. Thanks again! — Sean, Jul 17 '16 at 11:57

Rolling counts based on rolling cohorts

1 Answers1

How?