9

I have a table in the following format

Id  StartDate   EndDate Type
1   2012-02-18  2012-03-18  1
1   2012-03-17  2012-06-29  1
1   2012-06-27  2012-09-27  1
1   2014-08-23  2014-09-24  3
1   2014-09-23  2014-10-24  3
1   2014-10-23  2014-11-24  3
2   2015-07-04  2015-08-06  1
2   2015-08-04  2015-09-06  1
3   2013-11-01  2013-12-01  0
3   2018-01-09  2018-02-09  0

I found similar questions here, but not something that could help me solve my problem. I want to merge rows that has the same Id, Type and overlapping date periods.

The result from the above table should be

Id  StartDate   EndDate Type
1   2012-02-18  2012-09-27  1
1   2014-08-23  2014-11-24  3
2   2015-07-04  2015-09-06  1
3   2013-11-01  2013-12-01  0
3   2018-01-09  2018-02-09  0

In another server, I was able to do it with the following restrictions and the query below:

  • Didn't care about the Type column, but just the Id
  • Had a newer version of SQL Server (2012), but now I have 2008 which the code is not compatible

SELECT Id
     , MIN(StartDate) AS StartDate
     , MAX(EndDate) AS EndDate
FROM (
    SELECT *
         , SUM(CASE WHEN a.EndDate = a.StartDate THEN 0
                    ELSE 1
               END
           ) OVER (ORDER BY Id, StartDate) sm
    FROM (
        SELECT Id
             , StartDate
             , EndDate
             , LAG(EndDate, 1, NULL) OVER (PARTITION BY Id ORDER BY Id, EndDate) EndDate
        FROM #temptable
    ) a
) b
GROUP BY Id, sm

Any advice how I can

  • Include Type on the process
  • Make it work on SQL Server 2008
Vladimir Baranov
  • 31,799
  • 5
  • 53
  • 90
Tasos
  • 7,325
  • 18
  • 83
  • 176
  • what logic is used for `Id=3`..? – Pugal May 07 '19 at 09:03
  • So sorry. Was a mistake while was writing the question. I fixed it – Tasos May 07 '19 at 09:05
  • All your rows have 14 days interval. If less then 14 days interval, how the OP looks like..? For eg: `1 2014-10-23 2014-10-24 3` – Pugal May 07 '19 at 09:10
  • It will just take the min and max of the dates. Just as the 14 days was an actual part of the date period. – Tasos May 07 '19 at 09:11
  • I think this is similar to what my problem was. I had the same columns as you but I continued to create some additional helper columns for 'contiguous range', 'new_period', 'new_period_starting_id' etc. from my two starting date interval columns, using `ROW_NUMBER()` with `PARTITION BY`. Here's [my approach and solution](https://stackoverflow.com/questions/38399316/query-to-identify-contiguous-ranges). Unfortunately I don't have the full code anymore, but this was my approach. – Radu Gheorghiu May 07 '19 at 09:20
  • Sample data is best served as [DDL](https://en.wikipedia.org/wiki/Data_definition_language) + [DML](https://en.wikipedia.org/wiki/Data_manipulation_language). Please [edit] your question to include it, your current attempt and your desired results. For more details, [read this.](https://dba.meta.stackexchange.com/questions/2976/help-me-write-this-query-in-sql) – Zohar Peled May 07 '19 at 09:28
  • Is the enddate inclusive or exclusive? I mean if end date = 2019-01-10 and startdate for next row is 2019-10-11 is this a gap or contiguous? – Salman A Aug 27 '19 at 09:20
  • @SalmanA should be a gap – Tasos Aug 27 '19 at 10:10

5 Answers5

7

This approach uses an additional temp table to identify the groups of overlapping dates, and then performs a quick aggregate based on the groupings.

SELECT *, ROW_NUMBER() OVER (ORDER BY Id, Type) AS UID,
    ROW_NUMBER() OVER (ORDER BY Id, Type) AS GroupId INTO #G FROM #TempTable
WHILE @@ROWCOUNT <> 0 BEGIN
    UPDATE T1 SET
        GroupId = T2.GroupId
    FROM #G T1
        INNER JOIN (
            SELECT T1.UID, CASE WHEN T1.GroupId  < T2.GroupId THEN T1.GroupId ELSE T2.GroupId END
            FROM #G T1
                LEFT OUTER JOIN #G T2
                    ON T1.Id = T2.Id AND T1.Type = T2.Type AND T1.GroupId <> T2.GroupId
                        AND T1.StartDate <= T2.EndDate AND T2.StartDate <= T1.EndDate
            ) T2 (UID, GroupId)
            ON T1.UID = T2.UID
    WHERE T1.GroupId <> T2.GroupId
END
SELECT Id, MIN(StartDate) AS StartDate, MAX(EndDate) AS EndDate, Type
FROM #G G GROUP BY GroupId, Id, Type

This returns the expected values

Id          StartDate  EndDate    Type
----------- ---------- ---------- -----------
1           2012-02-18 2012-09-27 1
1           2014-08-23 2014-11-24 3
2           2015-07-04 2015-09-06 1
3           2013-11-01 2013-12-01 0
3           2018-01-09 2018-02-09 0
Jason W
  • 13,026
  • 3
  • 31
  • 62
  • It's working. I will keep the bounty open in case there is an answer with better performance, but I chose yours as the best to get the bounty if noone else come around. Thank you a lot – Tasos Aug 21 '19 at 17:45
  • 1
    Great, glad it's working for you. Hope you find best approach. For SQL 2008, I expect you'll see 2 basic solutions - (1) some indefinite iterative grouping approach like this or (2) a recursive CTE, such as from Jordan. Either is correct, and either may be faster based on your source data and complexity of the overlapping dates. – Jason W Aug 21 '19 at 19:31
  • 1
    For 10 rows, I found the CTE approach about 2x faster. For 1 million rows with same level of complexity (about 2-5 overlaps), I found the iterative approach about 3x faster than CTE. Just be mindful if you use CTE, you may have to set MAXRECURSION option if there are more than 1000 overlaps in your source data. – Jason W Aug 21 '19 at 19:33
  • @JasonW, This looks like a [Packing Intervals](http://blogs.solidq.com/en/sqlserver/packing-intervals/) problem. Itzik Ben-Gan presents a nice solution. – Vladimir Baranov Aug 25 '19 at 14:05
  • Very nice - I've seen similar approaches to this type of problem with Islands and Gaps (https://www.red-gate.com/simple-talk/sql/t-sql-programming/the-sql-of-gaps-and-islands-in-sequences/) as well. – Jason W Aug 26 '19 at 00:57
2

This is 2008 compatible. A CTE really is the best way to link up all overlapping records in my opinion. The date overlap logic came from this thread: SO Date Overlap

I added extra data that's more complex to make sure that it's working as expected.

DECLARE @Data table (Id INT, StartDate DATE, EndDate DATE, Type INT)
INSERT INTO @data 

SELECT 1,'2/18/2012' ,'3/18/2012', 1 UNION ALL
select 1,'3/17/2012','6/29/2012',1 UNION ALL
select 1,'6/27/2012','9/27/2012',1 UNION ALL
select 1,'8/23/2014','9/24/2014',3 UNION ALL
select 1,'9/23/2014','10/24/2014',3 UNION ALL
select 1,'10/23/2014','11/24/2014',3 UNION ALL
select 2,'7/4/2015','8/6/2015',1 UNION ALL
select 2,'8/4/2015','9/6/2015',1 UNION ALL
select 3,'11/1/2013','12/1/2013',0 UNION ALL
select 3,'1/9/2018','2/9/2018',0 UNION ALL 
select 4,'1/1/2018','1/2/2018',0 UNION ALL --many non overlapping dates
select 4,'1/4/2018','1/5/2018',0 UNION ALL
select 4,'1/7/2018','1/9/2018',0 UNION ALL
select 4,'1/11/2018','1/13/2018',0 UNION ALL

select 4,'2/7/2018','2/8/2018',0 UNION ALL --many overlapping dates
select 4,'2/8/2018','2/9/2018',0 UNION ALL 
select 4,'2/9/2018','2/10/2018',0 UNION all
select 4,'2/10/2018','2/11/2018',0 UNION all
select 4,'2/11/2018','2/12/2018',0 UNION all
select 4,'2/12/2018','2/13/2018',0 UNION all

select 4,'3/7/2018','3/8/2018',0 UNION ALL --many overlapping dates, second instance of id 4, type 0
select 4,'3/8/2018','3/9/2018',0 UNION ALL 
select 4,'3/9/2018','3/10/2018',0 UNION all
select 4,'3/10/2018','3/11/2018',0 UNION all
select 4,'3/11/2018','3/12/2018',0 UNION all
select 4,'3/12/2018','3/13/2018',0 



;
WITH cdata
AS (SELECT  Id,
        d.Type,
        d.StartDate,
        d.EndDate,
        CurrentStart = d.StartDate
    FROM    @Data d
    WHERE
        NOT EXISTS (
            SELECT * FROM @Data x WHERE x.StartDate < d.StartDate AND d.StartDate <= x.EndDate AND d.EndDate >= x.StartDate AND d.Id = x.Id AND d.Type = x.Type --get first records for overlapping ranges

                )       
    UNION ALL
    SELECT  d.Id,
        d.Type,
        StartDate = CASE WHEN d2.StartDate < d.StartDate THEN d2.StartDate ELSE d.StartDate END,
        EndDate = CASE WHEN d2.EndDate > d.EndDate THEN d2.EndDate ELSE d.EndDate END,
        CurrentStart = d2.StartDate
    FROM    cdata d
        INNER JOIN @Data d2
            ON (
                d.StartDate <= d2.EndDate
                AND d.EndDate >= d2.StartDate
            ) 
            AND d2.Id = d.Id
            AND d2.Type = d.Type
            AND d2.StartDate > d.CurrentStart)
SELECT cdata.Id, cdata.Type, cdata.StartDate, EndDate = MAX(cdata.EndDate) 
FROM        cdata 
GROUP BY cdata.Id, cdata.Type, cdata.StartDate
Jordan Ryder
  • 2,336
  • 1
  • 24
  • 29
1

This looks like a Packing Intervals problem. See the post by Itzik Ben-Gan for all the details and what indexes he recommends to make it work efficiently. He presents a solution without recursive CTE.

Two notes.

  1. The query below assumes that intervals are [closed; open), i.e. StartDate is inclusive and EndDate is exclusive. This way to represent such data is often the most convenient. (in the same sense as having arrays as zero-based instead of 1-based is usually more convenient in programming languages).

  2. I added a RowID column to have unambiguous sorting.

Sample data

DECLARE @T TABLE
(
    RowID int IDENTITY,
    id int,
    StartDate date,
    EndDate date,
    tp int
);

INSERT INTO @T(Id, StartDate, EndDate, tp) VALUES
(1,  '2012-02-18',  '2012-03-18',  1),
(1,  '2012-03-17',  '2012-06-29',  1),
(1,  '2012-06-27',  '2012-09-27',  1),
(1,  '2014-08-23',  '2014-09-24',  3),
(1,  '2014-09-23',  '2014-10-24',  3),
(1,  '2014-10-23',  '2014-11-24',  3),
(2,  '2015-07-04',  '2015-08-06',  1),
(2,  '2015-08-04',  '2015-09-06',  1),
(3,  '2013-11-01',  '2013-12-01',  0),
(3,  '2018-01-09',  '2018-02-09',  0);

-- Make EndDate an opened interval, make it exclusive
-- [Start; End)
UPDATE @T
SET EndDate = DATEADD(day, 1, EndDate)
;

Recommended indexes

-- indexes to support solutions
CREATE UNIQUE INDEX idx_start_id ON T(id, tp, StartDate, RowID);
CREATE UNIQUE INDEX idx_end_id ON T(id, tp, EndDate, RowID);

Query

Read the Itzik's post to understand what is going on. He has nice illustrations there. In short, each timestamp (start or end) is treated as an event. Each event has a + or - type. Each time we encounter a + event (some interval starts) we increase the running counter. Each time we encounter a - event (some interval ends) we decrease the running counter. When the running counter is 0 it means that the streak of overlapping intervals is over.

I took Itzik's query as is and simply changed the column names to match your names.

WITH C1 AS
-- let e = end ordinals, let s = start ordinals
(
    SELECT
        RowID, id, tp, StartDate AS ts, +1 AS EventType,
        NULL AS e,
        ROW_NUMBER() OVER(PARTITION BY id, tp ORDER BY StartDate, RowID) AS s
    FROM @T

    UNION ALL

    SELECT
        RowID, id, tp, EndDate AS ts, -1 AS EventType,
        ROW_NUMBER() OVER(PARTITION BY id, tp ORDER BY EndDate, RowID) AS e,
        NULL AS s
    FROM @T
),
C2 AS
-- let se = start or end ordinal, namely, how many events (start or end) happened so far
(
    SELECT C1.*,
    ROW_NUMBER() OVER(PARTITION BY id, tp ORDER BY ts, EventType DESC, RowID) AS se
    FROM C1
),
C3 AS
-- For start events, the expression s - (se - s) - 1 represents how many sessions were active
-- just before the current (hence - 1)
--
-- For end events, the expression (se - e) - e represents how many sessions are active
-- right after this one
--
-- The above two expressions are 0 exactly when a group of packed intervals
-- either starts or ends, respectively
--
-- After filtering only events when a group of packed intervals either starts or ends,
-- group each pair of adjacent start/end events
(
    SELECT id, tp, ts,
        ((ROW_NUMBER() OVER(PARTITION BY id, tp ORDER BY ts) - 1) / 2 + 1)
        AS grpnum
    FROM C2
    WHERE COALESCE(s - (se - s) - 1, (se - e) - e) = 0
)
SELECT id, tp, MIN(ts) AS StartDate, DATEADD(day, -1, MAX(ts)) AS EndDate
FROM C3
GROUP BY id, tp, grpnum
ORDER BY id, tp, StartDate;

Result

+----+----+------------+------------+
| id | tp | StartDate  |  EndDate   |
+----+----+------------+------------+
|  1 |  1 | 2012-02-18 | 2012-09-27 |
|  1 |  3 | 2014-08-23 | 2014-11-24 |
|  2 |  1 | 2015-07-04 | 2015-09-06 |
|  3 |  0 | 2013-11-01 | 2013-12-01 |
|  3 |  0 | 2018-01-09 | 2018-02-09 |
+----+----+------------+------------+
Vladimir Baranov
  • 31,799
  • 5
  • 53
  • 90
0
create table #table
(Id int,StartDate date,  EndDate date, Type int)

insert into #table
values

('1','2012-02-18','2012-03-18','1'),('1','2012-03-19','2012-06-19','1'),
('1','2012-06-27','2012-09-27','1'),('1','2014-08-23','2014-09-24','3'),
('1','2014-09-23','2014-10-24','3'),('1','2014-10-23','2014-11-24','3'),
('2','2015-07-04','2015-08-06','1'),('2','2015-08-04','2015-09-06','1'),
('3','2013-11-01','2013-12-01','0'),('3','2018-01-09','2018-02-09','0')

select ID,MIN(startdate)sd,MAX(EndDate)ed,type from #table
group by ID,TYPE,YEAR(startdate),YEAR(EndDate)
Dr.Stark
  • 116
  • 1
  • 4
0

this can be easily achieved by using some window-functions and CTE's. Here is the solution

DECLARE @table TABLE
(id        INT, 
 StartDate DATE, 
 EndDate   DATE, 
 [Type]    INT
);

INSERT INTO @table(Id,  StartDate,  EndDate,  [Type]) VALUES
(1,  '2012-02-18',  '2012-03-18',  1),
(1,  '2012-03-17',  '2012-06-29',  1),
(1,  '2012-06-27',  '2012-09-27',  1),
(1,  '2014-08-23',  '2014-09-24',  3),
(1,  '2014-09-23',  '2014-10-24',  3),
(1,  '2014-10-23',  '2014-11-24',  3),
(2,  '2015-07-04',  '2015-08-06',  1),
(2,  '2015-08-04',  '2015-09-06',  1),
(3,  '2013-11-01',  '2013-12-01',  0),
(3,  '2018-01-09',  '2018-02-09',  0);

WITH C1 AS
(
  SELECT *,
    MAX(EndDate) OVER(PARTITION BY Id, [Type]
          ORDER BY StartDate, EndDate
          ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) AS PrevEnd
  FROM @table
),
C2 AS
(
  SELECT *,
    SUM(StartFlag) OVER(PARTITION BY Id, [Type]
          ORDER BY StartDate, EndDate
          ROWS UNBOUNDED PRECEDING) AS GroupID
  FROM C1
    CROSS APPLY ( VALUES(CASE WHEN StartDate <= PrevEnd THEN NULL ELSE 1 END) ) AS A(StartFlag)
)
SELECT Id, [Type], MIN(StartDate) AS StartDate, MAX(EndDate) AS EndDate
FROM C2
GROUP BY Id, [Type], GroupID;
  • 2
    I receive this error here `Incorrect syntax near 'ROWS'`. Are you sure it is compatible with SQL Server 2008? – Tasos Aug 21 '19 at 17:13