Count distinct customers, active within a year, for every week of the year

Question

I am working with an existing E-commerce database. Actually, this process is usually done in Excel, but we want to try it directly with a query in PostgreSQL (version 10.6).

We define as an active customer a person who has bought at least once within 1 year. This means, if I analyze week 22 in 2020, an active customer will be the one that has bought at least once since week 22, 2019.

I want the output for each week of the year (2020). Basically what I need is ...

select
    email,
    orderdate,
    id
from
    orders_table
where
    paid = true;

|---------------------|-------------------|-----------------|
|      email          |     orderdate     |        id       |
|---------------------|-------------------|-----------------|
|  email1@email.com   |2020-06-02 05:04:32|     Order-2736  |
|---------------------|-------------------|-----------------|

I can't create new tables. And I would like to see the output like this:

Year| Week | Active customers
2020| 25   | 6978
2020| 24   | 3948

Please provide your version of Postgres (`SELECT version()` helps), a table definition (`CREATE TABLE` statement), and a few sample rows. — Erwin Brandstetter, Jun 17 '20 at 14:02
And "each week of the year"? This calendar year? Last? The year to date? — Erwin Brandstetter, Jun 17 '20 at 14:31

score 0 · Answer 1 · answered Jun 16 '20 at 21:08

depending on whether there is a year and week column you can use a OVER (PARTITION BY ...) with extract:

SELECT
    extract(year from orderdate),
    extract(week from orderdate),
    sum(1) as customer_count_in_week,
    OVER (PARTITION BY extract(YEAR FROM TIMESTAMP orderdate), 
                       extract(WEEK FROM TIMESTAMP orderdate))
FROM ordertable
WHERE paid=true;

Which should bucket all orders by year and week, thus showing the total count per week in a year where paid is true.

references:

Erwin Brandstetter · Answer 2 · 2020-06-18T00:34:35.240

if I analyze week 22 in 2020, an active customer will be the one that has bought at least once since week 22, 2019.

Problems on your side

This method has some corner case ambiguities / issues:

Do you include or exclude "week 22 in 2020"? (I exclude it below to stay closer to "a year".)
A year can have 52 or 53 full weeks. Depending on the current date, the calculation is based on 52 or 53 weeks, causing a possible bias of almost 2 %!

If you start the time range on "the same date last year", then the margin of error is only 1 / 365 or ~ 0.3 %, due to leap years.

A fixed "period of 365 days" (or 366) would eliminate the bias altogether.

Problems on the SQL side

Unfortunately, window functions do not currently allow the DISTINCT key word (for good reasons). So something of the form:

SELECT count(DISTINCT email) OVER (ORDER BY year, week
                                   GROUPS BETWEEN 52 PRECEDING AND 1 PRECEDING)
FROM   ...

.. triggers:

ERROR:  DISTINCT is not implemented for window functions

The GROUPS keyword has only been added in Postgres 10 and would otherwise be just what we need.

What's more, your odd frame definition wouldn't even work exactly, since the number of weeks to consider is not always 52, as discussed above.

So we have to roll our own.

Solution

The following simply generates all weeks of interest, and computes the distinct count of customers for each. Simple, except that date math is never entirely simple. But, depending on details of your setup, there may be faster solutions. (I had several other ideas.)

The time range for which to report may change. Here is an auxiliary function to generate weeks of a given year:

CREATE OR REPLACE FUNCTION f_weeks_of_year(_year int)
  RETURNS TABLE(year int, week int, week_start timestamp)
  LANGUAGE sql IMMUTABLE STRICT PARALLEL SAFE
  ROWS 52 COST 10 AS
$func$
SELECT _year, d.week::int, d.week_start
FROM   generate_series(date_trunc('week', make_date(_year, 01, 04)::timestamp)  -- first day of first week
                     , LEAST(date_trunc('week', localtimestamp), make_date(_year, 12, 28)::timestamp) -- latest possible start of week
                     , interval '1 week')  WITH ORDINALITY d(week_start, week)
$func$;

Call:

SELECT * FROM f_weeks_of_year(2020);

It returns 1 row per week, but stops at the current week for the current year. (Empty set for future years.)

The calculation is based on these facts:

The first ISO week of the year always contains January 04.
The last ISO week cannot start after December 28.

Actual week numbers are computed on the fly using WITH ORDINALITY. See:

PostgreSQL unnest() with element number

Aside, I stick to timestamp and avoid timestamptz for this purpose. See:

Generating time series between two dates in PostgreSQL

The function also returns the timestamp of the start of the week (week_start), which we don't need for the problem at hand. But I left it in to make the function more useful in general.

Makes the main query simpler:

WITH weekly_customer AS (
   SELECT DISTINCT
          EXTRACT(YEAR FROM orderdate)::int AS year
        , EXTRACT(WEEK FROM orderdate)::int AS week
        , email
   FROM   orders_table
   WHERE  paid
   AND    orderdate >= date_trunc('week', timestamp '2019-01-04')  -- max range for 2020!
   ORDER  BY 1, 2, 3  -- optional, might improve performance
   )
SELECT d.year, d.week
     , (SELECT count(DISTINCT email)
        FROM   weekly_customer w
        WHERE  (w.year, w.week) >= (d.year - 1, d.week)  -- row values, see below
        AND    (w.year, w.week) <  (d.year    , d.week)  -- exclude current week
       ) AS active_customers
FROM   f_weeks_of_year(2020) d;  -- (year int, week int, week_start timestamp)

db<>fiddle here

The CTE weekly_customer folds to unique customers per calendar week once, as duplicate entries are just noise for our calculation. It's used many times in the main query. The cut-off condition is based on Jan 04 once more. Adjust to your actual reporting period.

The actual count is done with a lowly correlated subquery. Could be a LEFT JOIN LATERAL ... ON true instead. See:

What is the difference between LATERAL and a subquery in PostgreSQL?

Using row value comparison to make the range definition simple. See:

SQL syntax term for 'WHERE (col1, col2) < (val1, val2)'

Count distinct customers, active within a year, for every week of the year

2 Answers2

Problems on your side

Problems on the SQL side

Solution