Complex 'Gaps and Islands' issue

Question

I have a table in a Postgres DB like this:

person    |     eventdate     |  type 
--------------------------------------
<uuid-1>  |   2016-05-14      |   300
<uuid-3>  |   2016-05-14      |   300
<uuid-1>  |   2016-05-15      |   301
<uuid-1>  |   2016-05-16      |   301
<uuid-1>  |   2016-05-18      |   304
<uuid-1>  |   2016-05-22      |   300
<uuid-2>  |   2016-05-22      |   304
<uuid-2>  |   2016-05-27      |   301
<uuid-1>  |   2016-05-30      |   300
<uuid-1>  |   2016-06-01      |   300
<uuid-2>  |   2016-06-15      |   501
<uuid-2>  |   2016-06-16      |   301
<uuid-4>  |   2016-06-16      |   300
<uuid-5>  |   2016-06-20      |   300
<uuid-1>  |   2016-06-21      |   300
<uuid-2>  |   2016-06-21      |   300
<uuid-2>  |   2016-06-23      |   301
<uuid-2>  |   2016-06-30      |   300
<uuid-3>  |   2016-06-30      |   300
<uuid-4>  |   2016-06-30      |   300

The table contains non consecutive day entries for each day an employee is absence for different reasons (types) of absence. However an absence period could span several of these days and any absence entry that is within 5 days of a previous absence of the same type is still considered part of the same absence 'period'.

I need to get output for each employees absences periods with start and end date of those periods, plus the total number of days within that multi-date-spanned period.

This is further complicated by the fact that different types of absence are considered the same for the purpose of this report. So in the example above, type 300, 301, 304 would be treated as the same.

So from my example above the following would be what I am after ...

person    |     startdate     |       enddate     |  days   |  type 
--------------------------------------------------------------------
<uuid-1>  |     2016-05-14    |     2016-05-22    |   5     |   300
<uuid-3>  |     2016-05-14    |     2016-04-14    |   1     |   300
<uuid-2>  |     2016-05-22    |     2016-04-27    |   2     |   304
<uuid-1>  |     2016-05-30    |     2016-06-01    |   2     |   300
<uuid-2>  |     2016-06-15    |     2016-06-15    |   1     |   501
<uuid-2>  |     2016-06-16    |     2016-06-16    |   1     |   301
<uuid-4>  |     2016-06-16    |     2016-06-16    |   1     |   300
<uuid-5>  |     2016-06-20    |     2016-06-20    |   1     |   300
<uuid-1>  |     2016-06-21    |     2016-06-21    |   1     |   300
<uuid-2>  |     2016-06-21    |     2016-06-23    |   2     |   300
<uuid-2>  |     2016-06-30    |     2016-06-30    |   1     |   300
<uuid-3>  |     2016-06-30    |     2016-06-30    |   1     |   300
<uuid-4>  |     2016-06-30    |     2016-06-30    |   1     |   300

How do I query this table to this output?

The result doesn't add up. There is no type `503` in the example. And it's unclear how you determine the `type` for each period. Postgres version and table definition showing all data types and constraints should be there, too. — Erwin Brandstetter, Aug 17 '16 at 00:33
@ErwinBrandstetter I am using the first value for the type, but the lowest will also work. The type output is not so important as making sure that the right types are blocked together. — Ben Drury, Aug 17 '16 at 07:45
My answer is blocking *all* types together. Not sure, maybe you wanted to keep 501 separate? — Erwin Brandstetter, Aug 17 '16 at 13:34

score 2 · Accepted Answer · edited May 23 '17 at 10:30

It's unclear how you determine the type for each period. I chose the minimum number.

Assuming this basic table definition:

CREATE TABLE tbl (person text, eventdate date, type int);

Basically, I suggest window functions in two nested subqueries to identify members of the same period (island). Then aggregate:

SELECT person, period
     , min(eventdate) AS startdate
     , max(eventdate) AS enddate
     , count(*)       AS days
     , min(type)      AS type
FROM  (
   SELECT person, eventdate, type
        , count(gap) OVER (PARTITION BY person ORDER BY eventdate) AS period
   FROM  (
      SELECT person, eventdate, type
           , CASE WHEN lag(eventdate) OVER (PARTITION BY person ORDER BY eventdate)
                     > eventdate - 6  -- within 5 days
                  THEN NULL           -- same period
                  ELSE TRUE           -- next period
             END AS gap
      FROM   tbl
      ) sub
   ) sub
GROUP  BY person, period
ORDER  BY person, period;

Result (based on your example data):

  person  | period | startdate  |  enddate   | days | type
----------+--------+------------+------------+------+------
 <uuid-1> |      1 | 2016-05-14 | 2016-05-22 |    5 |  300
 <uuid-1> |      2 | 2016-05-30 | 2016-06-01 |    2 |  300
 <uuid-1> |      3 | 2016-06-21 | 2016-06-21 |    1 |  300
 <uuid-2> |      1 | 2016-05-22 | 2016-05-27 |    2 |  301
 <uuid-2> |      2 | 2016-06-15 | 2016-06-23 |    4 |  300
 <uuid-2> |      3 | 2016-06-30 | 2016-06-30 |    1 |  300
 <uuid-3> |      1 | 2016-05-14 | 2016-05-14 |    1 |  300
 <uuid-3> |      2 | 2016-06-30 | 2016-06-30 |    1 |  300
 <uuid-4> |      1 | 2016-06-16 | 2016-06-16 |    1 |  300
 <uuid-4> |      2 | 2016-06-30 | 2016-06-30 |    1 |  300
 <uuid-5> |      1 | 2016-06-20 | 2016-06-20 |    1 |  300

If the same day for the same person can be entered multiple times with different types, and you only want to count distinct days, make it: count(DISTINCT eventdate) AS days.

Related, with detailed explanation:

BTW, eventdate - 6 works for data type date, but not for timestamp:

How do I determine the last day of the previous month using PostgreSQL?

Thanks Erwin. I've been trying to solve this for about two weeks now. I will plug this into the real data and see what happens. — Ben Drury, Aug 17 '16 at 07:43
That's the one. Thanks Erwin. Appreciate it. (Still not sure I understand fully what the query is doing by it works in situ with my data.) — Ben Drury, Aug 17 '16 at 08:42
@BenDrury: Have a look at the linked answers for more explanation. — Erwin Brandstetter, Aug 17 '16 at 12:56

Complex 'Gaps and Islands' issue

1 Answers1

Linked