5

I'm writing an app that handles scheduling time off for some of our employees. As part of this, I need to calculate how many minutes throughout the day that they have requested off.

In the first version of this tool, we disallowed overlapping time off requests, because we wanted to be able to just add up the total of StartTime minus EndTime for all requests. Preventing overlaps makes this calculation very fast.

This has become problematic, because Managers now want to schedule team meetings but are unable to do so when someone has already asked for the day off.

So, in the new version of the tool, we have a requirement to allow overlapping requests.

Here is an example set of data like what we have:

UserId | StartDate | EndDate
----------------------------
 1     | 2:00      | 4:00
 1     | 3:00      | 5:00
 1     | 3:45      | 9:00
 2     | 6:00      | 9:00
 2     | 7:00      | 8:00
 3     | 2:00      | 3:00
 3     | 4:00      | 5:00
 4     | 1:00      | 7:00

The result that I need to get, as efficiently as possible, is this:

UserId | StartDate | EndDate
----------------------------
 1     | 2:00      | 9:00
 2     | 6:00      | 9:00
 3     | 2:00      | 3:00
 3     | 4:00      | 5:00
 4     | 1:00      | 7:00

We can easily detect overlaps with this query:

select
    *
from
    requests r1
cross join
    requests r2
where
    r1.RequestId < r2.RequestId
  and
    r1.StartTime < r2.EndTime
  and
    r2.StartTime < r1.EndTime

This is, in fact, how we were detecting and preventing the problems originally.

Now, we are trying to merge the overlapping items, but I'm reaching the limits of my SQL ninja skills.

It wouldn't be too hard to come up with a method using temp tables, but we want to avoid this if at all possible.

Is there a set-based way to merge overlapping rows?


Edit:

It would also be acceptable for the all of the rows to show up, as long as they were collapsed into just their time. For example if someone wants off from three to five, and from four to six, it would be acceptable for them to have two rows, one from three to five, and the next from five to six OR one from three to four, and the next from four to six.

Also, here is a little test bench:

DECLARE @requests TABLE
(
    UserId int,
    StartDate time,
    EndDate time
)

INSERT INTO @requests (UserId, StartDate, EndDate) VALUES
(1, '2:00', '4:00'),
(1, '3:00', '5:00'),
(1, '3:45', '9:00'),
(2, '6:00', '9:00'),
(2, '7:00', '8:00'),
(3, '2:00', '3:00'),
(3, '4:00', '5:00'),
(4, '1:00', '7:00');
John Gietzen
  • 48,783
  • 32
  • 145
  • 190

3 Answers3

4

Complete Rewrite:

;WITH new_grp AS (
   SELECT r1.UserId, r1.StartTime
   FROM   @requests r1
   WHERE  NOT EXISTS (
          SELECT *
          FROM   @requests r2
          WHERE  r1.UserId = r2.UserId
          AND    r2.StartTime <  r1.StartTime
          AND    r2.EndTime   >= r1.StartTime)
   GROUP  BY r1.UserId, r1.StartTime -- there can be > 1
   ),r AS (
   SELECT r.RequestId, r.UserId, r.StartTime, r.EndTime
         ,count(*) AS grp -- guaranteed to be 1+
   FROM   @requests r
   JOIN   new_grp n ON n.UserId = r.UserId AND n.StartTime <= r.StartTime
   GROUP  BY r.RequestId, r.UserId, r.StartTime, r.EndTime
   )
SELECT min(RequestId) AS RequestId
      ,UserId
      ,min(StartTime) AS StartTime
      ,max(EndTime)   AS EndTime
FROM   r
GROUP  BY UserId, grp
ORDER  BY UserId, grp

Now produces the requested result and really covers all possible cases, including disjunct sub-groups and duplicates. Have a look at the comments to the test data in the working demo at data.SE.

  • CTE 1
    Find the (unique!) points in time where a new group of overlapping intervals starts.

  • CTE 2
    Count the starts of new group up to (and including) every individual interval, thereby forming a unique group number per user.

  • Final SELECT
    Merge the groups, take earlies start and latest end for groups.

I faced some difficulty, because T-SQL window functions max() or sum() do not accept an ORDER BY clause in a in a window. They can only compute one value per partition, which makes it impossible to compute a running sum / count per partition. Would work in PostgreSQL or Oracle (but not in MySQL, of course - it has neither window functions nor CTEs).

The final solution uses one extra CTE and should be just as fast.

Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • @JohnGietzen: Your question, you be the judge of that. And I agree, as my last version was incorrect for special cases and was not optimized. I rewrote it completely, it should *really* cover all possible cases now. I also agree, that recursive CTE are a very appealing concept, and your post is very instructve. However, it is **incorrect** in its current form. I'll comment on your answer. – Erwin Brandstetter Dec 03 '11 at 18:34
4

Ok, it is possible to do with CTEs. I did not know how to use them at the beginning of the night, but here is the results of my research:

A recursive CTE has 2 parts, the "anchor" statement and the "recursive" statements.

The crucial part about the recursive statement is that when it is evaluated, only the rows that have not already been evaluated will show up in the recursion.

So, for example, if we wanted to use CTEs to get an all-inclusive list of times for these users, we could use something like this:

WITH
  sorted_requests as (
    SELECT
        UserId, StartDate, EndDate,
        ROW_NUMBER() OVER (PARTITION BY UserId ORDER BY StartDate, EndDate DESC) Instance
    FROM @requests
  ),
  no_overlap(UserId, StartDate, EndDate, Instance) as (
    SELECT *
    FROM sorted_requests
    WHERE Instance = 1

    UNION ALL

    SELECT s.*
    FROM sorted_requests s
    INNER JOIN no_overlap n
    ON s.UserId = n.UserId
    AND s.Instance = n.Instance + 1
  )
SELECT *
FROM no_overlap

Here, the "anchor" statement is just the first instance for every user, WHERE Instance = 1.

The "recursive" statement joins each row to the next row in the set, using the s.UserId = n.UserId AND s.Instance = n.Instance + 1

Now, we can use the property of the data, when sorted by start date, that any overlapping row will have a start date that is less than the previous row's end date. If we continually propagate the row number of the first intersecting row, every subsequent overlapping row will share that row number.

Using this query:

WITH
  sorted_requests as (
    SELECT
        UserId, StartDate, EndDate,
        ROW_NUMBER() OVER (PARTITION BY UserId ORDER BY StartDate, EndDate DESC) Instance
    FROM
        @requests
  ),
  no_overlap(UserId, StartDate, EndDate, Instance, ConnectedGroup) as (
    SELECT
        UserId,
        StartDate,
        EndDate,
        Instance,
        Instance as ConnectedGroup
    FROM sorted_requests
    WHERE Instance = 1

    UNION ALL

    SELECT
        s.UserId,
        s.StartDate,
        CASE WHEN n.EndDate >= s.EndDate
            THEN n.EndDate
            ELSE s.EndDate
        END EndDate,
        s.Instance,
        CASE WHEN n.EndDate >= s.StartDate
            THEN n.ConnectedGroup
            ELSE s.Instance
        END ConnectedGroup
    FROM sorted_requests s
    INNER JOIN no_overlap n
    ON s.UserId = n.UserId AND s.Instance = n.Instance + 1
  )
SELECT
    UserId,
    MIN(StartDate) StartDate,
    MAX(EndDate) EndDate
FROM no_overlap
GROUP BY UserId, ConnectedGroup
ORDER BY UserId

We group by the aforementioned "first intersecting row" (called ConnectedGroup in this query) and find the minimum start time and maximum end time in that group.

The first intersecting row is propagated using this statement:

CASE WHEN n.EndDate >= s.StartDate
    THEN n.ConnectedGroup
    ELSE s.Instance
END ConnectedGroup

Which basically says, "if this row intersects with the previous row (based on us being sorted by start date), then consider this row to have the same 'row grouping' as the previous row. Otherwise, use this row's own row number as the 'row grouping' for itself."

This gives us exactly what we were looking for.

EDIT

When I had originally thought this up on my whiteboard, I knew that I would have to advance the EndDate of each row, to ensure that it would intersect with the next row, if any of the previous rows in the connected group would have intersected. I accidentally left that out. This has been corrected.

John Gietzen
  • 48,783
  • 32
  • 145
  • 190
  • +1 for a very educational post about recursive CTE. However, the result is incorrect in that it assumes that the start of a new group of overlapping intervals could be determined from the preceding row alone. If that was the case, it would be a much simpler problem and my first approach would have sufficed. You can copy your query to [my demo on data.SE](http://data.stackexchange.com/stackoverflow/qe/2219/merge-overlapping-intervals-using-cte-but-no-window-functions) to test. It demonstrates the false results. (I adjusted my test case with identical column names, so it can work as a drop-in.) – Erwin Brandstetter Dec 03 '11 at 18:39
  • @ErwinBrandstetter, I was worried about the problem with determining overlap with the previous row alone, which is why I have sorted both by the start date and the end date. Do you have an example of a case that would cause my query to deliver incorrect results? – John Gietzen Dec 03 '11 at 20:08
  • @ErwinBrandstetter: You are correct. I had thought about this before and had a fix, but I left it out. Let me edit my version to fix it. – John Gietzen Dec 03 '11 at 20:11
  • For an example of a previously failing test case: User #5, from 3:00 to 8:00, from 4:00 to 5:00 and from 7:00 to 9:00. The new version of my query fixes this. – John Gietzen Dec 03 '11 at 20:38
1

This works for postgres. Microsoft might need some modifications.

SET search_path='tmp';

DROP TABLE tmp.schedule CASCADE;

CREATE TABLE tmp.schedule
        ( person_id INTEGER NOT NULL
        , dt_from timestamp with time zone
        , dt_to timestamp with time zone
        );
INSERT INTO schedule( person_id, dt_from, dt_to) VALUES
          ( 1, '2011-12-03 02:00:00' , '2011-12-03 04:00:00' )
        , ( 1, '2011-12-03 03:00:00' , '2011-12-03 05:00:00' )
        , ( 1, '2011-12-03 03:45:00' , '2011-12-03 09:00:00' )
        , ( 2, '2011-12-03 06:00:00' , '2011-12-03 09:00:00' )
        , ( 2, '2011-12-03 07:00:00' , '2011-12-03 08:00:00' )
        , ( 3, '2011-12-03 02:00:00' , '2011-12-03 03:00:00' )
        , ( 3, '2011-12-03 04:00:00' , '2011-12-03 05:00:00' )
        , ( 4, '2011-12-03 01:00:00' , '2011-12-03 07:00:00' );

ALTER TABLE schedule ADD PRIMARY KEY (person_id,dt_from)
        ;
CREATE UNIQUE INDEX ON schedule (person_id,dt_to);

SELECT * FROM schedule ORDER BY person_id, dt_from;

WITH RECURSIVE ztree AS (
    -- Terminal part
    SELECT p1.person_id AS person_id
       , p1.dt_from AS dt_from
       , p1.dt_to AS dt_to
    FROM schedule p1
    UNION
    -- Recursive part
    SELECT p2.person_id AS person_id
       , LEAST(p2.dt_from, zzt.dt_from) AS dt_from
       , GREATEST(p2.dt_to, zzt.dt_to) AS dt_to
    FROM ztree AS zzt
       , schedule AS p2
    WHERE 1=1
    AND p2.person_id = zzt.person_id
    AND (p2.dt_from < zzt.dt_from AND p2.dt_to >= zzt.dt_from)
    )
SELECT *
FROM ztree zt
WHERE NOT EXISTS (
   SELECT * FROM ztree nx
   WHERE nx.person_id = zt.person_id
           -- the recursive query returns *all possible combinations of
           -- touching or overlapping intervals
           -- we'll have to filter, keeping only the biggest ones
           -- (the ones for which there is no bigger overlapping interval)
   AND     ( (nx.dt_from <= zt.dt_from AND nx.dt_to > zt.dt_to)
          OR (nx.dt_from < zt.dt_from AND nx.dt_to >= zt.dt_to)
          )
      )
ORDER BY zt.person_id,zt.dt_from
    ;

Result:

DROP TABLE
CREATE TABLE
INSERT 0 8
NOTICE:  ALTER TABLE / ADD PRIMARY KEY will create implicit index "schedule_pkey"  for table "schedule"
ALTER TABLE
CREATE INDEX
 person_id |        dt_from         |         dt_to          
-----------+------------------------+------------------------
         1 | 2011-12-03 02:00:00+01 | 2011-12-03 04:00:00+01
         1 | 2011-12-03 03:00:00+01 | 2011-12-03 05:00:00+01
         1 | 2011-12-03 03:45:00+01 | 2011-12-03 09:00:00+01
         2 | 2011-12-03 06:00:00+01 | 2011-12-03 09:00:00+01
         2 | 2011-12-03 07:00:00+01 | 2011-12-03 08:00:00+01
         3 | 2011-12-03 02:00:00+01 | 2011-12-03 03:00:00+01
         3 | 2011-12-03 04:00:00+01 | 2011-12-03 05:00:00+01
         4 | 2011-12-03 01:00:00+01 | 2011-12-03 07:00:00+01
(8 rows)

 person_id |        dt_from         |         dt_to          
-----------+------------------------+------------------------
         1 | 2011-12-03 02:00:00+01 | 2011-12-03 09:00:00+01
         2 | 2011-12-03 06:00:00+01 | 2011-12-03 09:00:00+01
         3 | 2011-12-03 02:00:00+01 | 2011-12-03 03:00:00+01
         3 | 2011-12-03 04:00:00+01 | 2011-12-03 05:00:00+01
         4 | 2011-12-03 01:00:00+01 | 2011-12-03 07:00:00+01
(5 rows)
wildplasser
  • 43,142
  • 8
  • 66
  • 109
  • I don't understand why the terminal part of the CTE doesn't just return all of the rows in the table. – John Gietzen Dec 03 '11 at 20:20
  • It does, but they are filtered out (even worse: the sub-assemblies are also produced by the CTE). See the comment above the AND( OR AND OR) condition in the final query. There might be a performance-penalty for non-postgres DBMSs, postgres's optimiser/plan generator is very aggressive, shuffling tables or terms in or out of subqueries, CTE's are hanldled extremely well, and perform better than naive self-joins in the cases where "adjacent" keys are to be stitched together, like in this case. – wildplasser Dec 03 '11 at 20:31