Flattening intersecting timespans by user in PostgreSQL

Question

I'm trying to merge overlapping start-end timestamps into single timespan. Similar problem is available here on SO. I would like to merge timestamps separately for every user I have in data.

SQLFiddle

Sample data:

-- drop table if exists app_log;

create table app_log (
  user_id int,
  login_time timestamp,
  logout_time timestamp
);

insert into app_log values
  (1, '2014-01-01 08:00', '2014-01-01 10:00'), /* here we start */
  (1, '2014-01-01 09:10', '2014-01-01 09:59'), /* fully included in previous interval */
  (1, '2014-01-01 10:00', '2014-01-01 10:48'), /* continuing first interval */
  (1, '2014-01-01 10:40', '2014-01-01 10:49'), /* continuing previous interval */
  (1, '2014-01-01 10:55', '2014-01-01 11:00'), /* isolated interval */
  (2, '2014-01-01 09:00', '2014-01-01 11:00'), /* 2nd user is shifted by one hour */
  (2, '2014-01-01 10:10', '2014-01-01 10:59'), /* to simulate overlaps with 1st user */
  (2, '2014-01-01 11:00', '2014-01-01 11:48'), 
  (2, '2014-01-01 11:40', '2014-01-01 11:49'), 
  (2, '2014-01-01 11:55', '2014-01-01 12:00')  
;

Required results:

  used_id  login_time       logout_time
  1        2014-01-01 08:00 2014-01-01 10:49 /* Merging first 4 lines */
  1        2014-01-01 10:55 2014-01-01 11:00 /* 5 th line is isolated */
  2        2014-01-01 09:00 2014-01-01 11:49 /* Merging lines 6-9 */
  2        2014-01-01 11:55 2014-01-01 12:00 /* last line is isolated */

I've tried to use on of the solutions provided in the mentioned question, but it does not return correct answer even for single user:

with recursive

in_data as (select login_time as d1, logout_time as d2 from app_log where user_id = 1)

, dateRanges (ancestorD1, parentD1, d2, iter) as
(
--anchor is first level of collapse
    select
        d1 as ancestorD1,
        d1 as parentD1,
        d2,
        cast(0 as int) as iter
    from in_data

--recurse as long as there is another range to fold in
    union all

    select
        tLeft.ancestorD1,
        tRight.d1 as parentD1,
        tRight.d2,
        iter + 1  as iter
    from dateRanges as tLeft join in_data as tRight
        --join condition is that the t1 row can be consumed by the recursive row
        on tLeft.d2 between tRight.d1 and tRight.d2
            --exclude identical rows
            and not (tLeft.parentD1 = tRight.d1 and tLeft.d2 = tRight.d2)
)
select
    ranges1.*
from dateRanges as ranges1
where not exists (
    select 1
    from dateRanges as ranges2
    where ranges1.ancestorD1 between ranges2.ancestorD1 and ranges2.d2
        and ranges1.d2 between ranges2.ancestorD1 and ranges2.d2
        and ranges2.iter > ranges1.iter
);

Results in:

ancestord1 parentd1 d2 iter
2014-01-01 10:55:00;2014-01-01 10:55:00;2014-01-01 11:00:00;0
2014-01-01 08:00:00;2014-01-01 10:40:00;2014-01-01 10:49:00;2
2014-01-01 09:10:00;2014-01-01 10:40:00;2014-01-01 10:49:00;3

What is wrong with the query above and how can I extend it to get results by users? Is there better solution for this problem in PostgreSQL?

As a side note, [don't use BETWEEN with date/time/timestamp types](https://sqlblog.org/2011/10/19/what-do-between-and-the-devil-have-in-common). The specific risk SQL Server has is still possible (depending on whether your distro of PostgreSQL was compiled with double-backed timestamps), but less likely; however, my personal belief is that `BETWEEN` actually promotes an incorrect view of ranges over non-integral types. That is, think of everything else as _logically_ (if not actually) being a natural number, with a potentially undefined precision. — Clockwork-Muse, Feb 21 '14 at 12:50

score 2 · Accepted Answer · answered Feb 21 '14 at 09:24

I found this example of how to make a 'range aggregate' using windowing functions and a lot of nested subqueries. I just adapted it to partition and group by user_id, and it seems to do what you want:

SELECT user_id, min(login_time) as login_time, max(logout_time) as logout_time
FROM (
    SELECT user_id, login_time, logout_time,
        max(new_start) OVER (PARTITION BY user_id ORDER BY login_time, logout_time) AS left_edge
    FROM (
        SELECT user_id, login_time, logout_time,
            CASE 
                WHEN login_time <= max(lag_logout_time) OVER (
                    PARTITION BY user_id ORDER BY login_time, logout_time 
                ) THEN NULL 
                ELSE login_time 
            END AS new_start
        FROM (
            SELECT 
                user_id, 
                login_time, 
                logout_time,
                lag(logout_time) OVER (PARTITION BY user_id ORDER BY login_time, logout_time) AS lag_logout_time
            FROM app_log
        ) AS s1
    ) AS s2
) AS s3
GROUP BY user_id, left_edge
ORDER BY user_id, min(login_time)

Results in:

 user_id |     login_time      |     logout_time
---------+---------------------+---------------------
       1 | 2014-01-01 08:00:00 | 2014-01-01 10:49:00
       1 | 2014-01-01 10:55:00 | 2014-01-01 11:00:00
       2 | 2014-01-01 09:00:00 | 2014-01-01 11:49:00
       2 | 2014-01-01 11:55:00 | 2014-01-01 12:00:00
(4 rows)

It works by first detecting the beginning of each new range (partitioned by user_id), then extending and grouping by the detected ranges. I found I had to read that article very carefully to understand it!

The article suggests it can be simplified with Postgresql>=9.0 by removing the innermost subquery and changing the window range, but I could not get that to work.

Looks great, need to spend some time to actually underststand it. I think innermost subquery can be replaced with `ROWS` clause like this: `SELECT user_id, login_time, logout_time, CASE WHEN login_time <= max(logout_time) OVER (PARTITION BY user_id ORDER BY login_time, logout_time rows between unbounded preceding and 1 preceding) THEN NULL ELSE login_time END AS new_start FROM app_log`. Maybe you can add this into your answer so we have simplest statement possible. — Tomas Greif, Feb 21 '14 at 11:42

Flattening intersecting timespans by user in PostgreSQL

1 Answers1