How to form groups of consecutive dates allowing for a given maximum gap?

Question

Given a table like:

person_id	contact_day	days_last_contact	dash_group
1	2015-02-09		1
1	2015-05-01	81	2
1	2015-05-02	1	2
1	2015-05-03	1	2
1	2015-06-01	29	3
1	2015-08-01	61	4
1	2015-08-04	3	4
1	2015-09-01	28	5
2	2015-05-01		1
2	2015-06-01	31	2
2	2015-07-01	30	3
3	2015-05-01		1
3	2015-05-02	1	1
3	2015-05-04	2	1
3	2015-06-01	28	2
3	2015-06-02	1	2
3	2015-06-06	4	3

See also DB Fiddle example.

How can I identify streaks of days - consecutive but allowing for a maximum gap?

Original columns in the data are person_id and contact_day. I would like to partition by person_id and the "streak" (group of days close by). My approach so far is to first calculate the number of days since last contact (days_last_contact) and then try to use that to calculate the column dash_group, which tags rows within the maximum threshold - 3 days in the example.

How can I calculate dash_group? I calculate days_last_contact by subtracting contact_day and it's 1-lag, partitioned by person_id and sorted by date).

SELECT 
  contact_day - lag(contact_day, 1, NULL) 
    OVER (PARTITION BY person_id ORDER BY contact_day ASC) 
    AS days_last_contact
FROM mydata
;

But how can I then use that to group together the rows where days_last_contact is below some threshold? (3 days in this example). So, in this example, dash_group 2 for person_id 1 identify the days May 1, 2 and 3 are close by, but then the next date for that person is June 1, which is too far (29 days since last contact, which is greater than the threshold of 3), so it gets a new dash_group. Similarly, dash_group 4, groups together August 1 and August 4, because the difference is 3, but in the case of June 2 and June 6 (person 3), the difference is 4 and then they are classified in different groups.

After looking around, I've found for example this SO question where they point to the 'trick' #4 here, which is nicely hacky, but only works for consecutive dates / gapless series, and I need to allow for arbitrary gaps.

Erwin Brandstetter · Accepted Answer · 2022-05-22T11:10:17.213

Counting the gaps (greater than given tolerance) in a second window function forms the group numbers you are after:

SELECT person_id, contact_day
     , count(*) FILTER (WHERE gap > 3) OVER (PARTITION BY person_id ORDER BY contact_day) AS dash_group
FROM  (
   SELECT person_id, contact_day
        , contact_day - lag(contact_day) OVER (PARTITION BY person_id ORDER BY contact_day) AS gap
   FROM   mydata
   ) sub
ORDER  BY person_id, contact_day;  -- optional

db<>fiddle here

About the aggregate FILTER clause:

Aggregate columns with additional (distinct) filters

It's short and intuitive, and typically fastest. See:

For absolute performance, is SUM faster or COUNT?

It's the classic topic of "gaps and islands". Once you know to look for the tag gaps-and-islands, you'll find plenty of related or nigh identical questions and answers like:

Etc.

I tagged accordingly now.

wildplasser · Answer 2 · 2022-05-09T11:21:26.527

Using a recursive query:

WITH RECURSIVE zzz AS (
    SELECT person_id
    , contact_day
    , md.days_last_contact
    , row_number() OVER(PARTITION BY person_id ORDER BY contact_day)
        AS dash_group
    FROM mydata md
    WHERE NOT EXISTS ( -- only the group *leaders*
            SELECT * FROM mydata nx
            WHERE nx.person_id = md.person_id
            AND nx.contact_day < md.contact_day
            AND nx.contact_day >= md.contact_day -3
            )
UNION ALL
    SELECT md.person_id
    , md.contact_day
    , md.days_last_contact
    , zzz.dash_group
    FROM zzz
    JOIN mydata md ON md.person_id = zzz.person_id
            AND md.contact_day > zzz.contact_day
            AND md.contact_day <= zzz.contact_day +3
            AND NOT EXISTS ( SELECT * -- eliminate the middle men ...
                    FROM mydata nx
                    WHERE nx.person_id = md.person_id
                    AND nx.contact_day > zzz.contact_day
                    AND nx.contact_day < md.contact_day
            )
    )
SELECT * FROM zzz
ORDER BY person_id,contact_day
    ;

There will probably be a shorter solution using window functions.

The result:

DROP SCHEMA
CREATE SCHEMA
SET
CREATE TABLE
INSERT 0 14
 person_id | contact_day | days_last_contact | dash_group 
-----------+-------------+-------------------+------------
         1 | 2015-02-09  |                   |          1
         1 | 2015-05-01  |                81 |          2
         1 | 2015-05-02  |                 1 |          2
         1 | 2015-05-03  |                 1 |          2
         1 | 2015-06-01  |                29 |          3
         1 | 2015-08-01  |                61 |          4
         1 | 2015-08-04  |                 3 |          4
         1 | 2015-09-01  |                28 |          5
         2 | 2015-05-01  |                   |          1
         2 | 2015-06-01  |                31 |          2
         2 | 2015-07-01  |                30 |          3
         3 | 2015-05-01  |                   |          1
         3 | 2015-05-02  |                 1 |          1
         3 | 2015-05-04  |                 2 |          1
(14 rows)

D-Shih · Answer 3 · 2022-05-16T07:17:08.883

If I understand correctly, we can try to use condition in SUM window function.

If we created a suitable index (person_id and contact_day columns) in mydata table we might get better performance for this query.

Query #1

So query might be as below

SELECT 
  person_id, 
  contact_day, 
  days_last_contact,
  SUM(CASE WHEN days_last_contact <= 3 THEN 0 ELSE 1 END) OVER(PARTITION BY person_id ORDER BY contact_day) 
FROM mydata
ORDER BY person_id, contact_day
;

if days_last_contact needs to be calculated, we can try to calculate it subquery.

SELECT 
  person_id, 
  contact_day, 
  days_last_contact,
  SUM(CASE WHEN days_last_contact <= 3 THEN 0 ELSE 1 END) OVER(PARTITION BY person_id ORDER BY contact_day) 
FROM (
    SELECT person_id,
           contact_day,
           contact_day - lag(contact_day) 
        OVER (PARTITION BY person_id ORDER BY contact_day ASC) 
        AS days_last_contact
    FROM mydata
) t1
ORDER BY person_id, contact_day
;

person_id	contact_day	days_last_contact	sum
1	2015-02-09T00:00:00.000Z		1
1	2015-05-01T00:00:00.000Z	81	2
1	2015-05-02T00:00:00.000Z	1	2
1	2015-05-03T00:00:00.000Z	1	2
1	2015-06-01T00:00:00.000Z	29	3
1	2015-08-01T00:00:00.000Z	61	4
1	2015-08-04T00:00:00.000Z	3	4
1	2015-09-01T00:00:00.000Z	28	5
2	2015-05-01T00:00:00.000Z		1
2	2015-06-01T00:00:00.000Z	31	2
2	2015-07-01T00:00:00.000Z	30	3
3	2015-05-01T00:00:00.000Z		1
3	2015-05-02T00:00:00.000Z	1	1
3	2015-05-04T00:00:00.000Z	2	1
3	2015-06-01T00:00:00.000Z	28	2
3	2015-06-02T00:00:00.000Z	1	2
3	2015-06-06T00:00:00.000Z	4	3

View on DB Fiddle

How to form groups of consecutive dates allowing for a given maximum gap?

3 Answers3