25

Is there a better way of merging overlapping date intervals?
The solution I came up with is so simple that now I wonder if someone else has a better idea of how this could be done.

/***** DATA EXAMPLE *****/
DECLARE @T TABLE (d1 DATETIME, d2 DATETIME)
INSERT INTO @T (d1, d2)
        SELECT '2010-01-01','2010-03-31' UNION SELECT '2010-04-01','2010-05-31' 
  UNION SELECT '2010-06-15','2010-06-25' UNION SELECT '2010-06-26','2010-07-10' 
  UNION SELECT '2010-08-01','2010-08-05' UNION SELECT '2010-08-01','2010-08-09' 
  UNION SELECT '2010-08-02','2010-08-07' UNION SELECT '2010-08-08','2010-08-08' 
  UNION SELECT '2010-08-09','2010-08-12' UNION SELECT '2010-07-04','2010-08-16' 
  UNION SELECT '2010-11-01','2010-12-31' UNION SELECT '2010-03-01','2010-06-13' 

/***** INTERVAL ANALYSIS *****/
WHILE (1=1)  BEGIN
  UPDATE t1 SET t1.d2 = t2.d2
  FROM @T AS t1 INNER JOIN @T AS t2 ON 
            DATEADD(day, 1, t1.d2) BETWEEN t2.d1 AND t2.d2 
  IF @@ROWCOUNT = 0 BREAK
END

/***** RESULT *****/
SELECT StartDate = MIN(d1) , EndDate = d2
FROM @T
GROUP BY d2
ORDER BY StartDate, EndDate

/***** OUTPUT *****/
/*****
StartDate   EndDate
2010-01-01  2010-06-13 
2010-06-15  2010-08-16 
2010-11-01  2010-12-31 
*****/
gbn
  • 422,506
  • 82
  • 585
  • 676
leoinfo
  • 7,860
  • 8
  • 36
  • 48
  • 2
    Are the intervals open-open, closed-closed, open-closed or closed-open? It matters because the end conditions vary slightly depending. For many purposes, open-closed (including first date, excluding second date) is the best representation; open-open (both ends included) is often what people have in mind. – Jonathan Leffler Apr 01 '10 at 14:56
  • Jonathan, I was thinking about cases when both (start date and end date) days are part of the period. – leoinfo Apr 01 '10 at 15:09
  • It is possible to do it single-pass, but it's a cursor implementation so it depends on the size of the dataset. – Lasse V. Karlsen May 02 '10 at 11:22
  • @Lasse: Do you have an example? I could avoid cursor usage... – leoinfo May 03 '10 at 14:54
  • @leoinfo this is somehow related question - http://stackoverflow.com/questions/18618999/group-all-related-records-in-many-to-many-relationship-sql-graph-connected-comp/18663943#18663943, I've checked different approach (and cursor too) and iterative update was the fastest one, so I think your good solution is still the best one. – Roman Pekar Sep 23 '13 at 12:14

10 Answers10

31

I was looking for the same solution and came across this post on Combine overlapping datetime to return single overlapping range record.

There is another thread on Packing Date Intervals.

I tested this with various date ranges, including the ones listed here, and it works correctly every time.


SELECT 
       s1.StartDate,
       --t1.EndDate 
       MIN(t1.EndDate) AS EndDate
FROM @T s1 
INNER JOIN @T t1 ON s1.StartDate <= t1.EndDate
  AND NOT EXISTS(SELECT * FROM @T t2 
                 WHERE t1.EndDate >= t2.StartDate AND t1.EndDate < t2.EndDate) 
WHERE NOT EXISTS(SELECT * FROM @T s2 
                 WHERE s1.StartDate > s2.StartDate AND s1.StartDate <= s2.EndDate) 
GROUP BY s1.StartDate 
ORDER BY s1.StartDate 

The result is:

StartDate  | EndDate
2010-01-01 | 2010-06-13
2010-06-15 | 2010-06-25
2010-06-26 | 2010-08-16
2010-11-01 | 2010-12-31
Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
user1045402
  • 319
  • 3
  • 6
  • Also, found another example with explanation on how to achieve this here: http://www.sqlmag.com/blog/puzzled-by-t-sql-blog-15/tsql/packing-date-intervals-136831 – user1045402 Nov 23 '11 at 11:50
  • 1
    You can edit your own answer to add more information, just click on the "edit" link at the bottom of your answer. – ForceMagic Nov 06 '12 at 23:15
  • 4
    Note that the /http://www.sqlmag.com/blog/puzzled-by-t-sql-blog-15/tsql/packing-date-intervals-136831 URL is 404 as of 2017-12-19. The WayBack Machine gives a historical reference: https://web.archive.org/web/20130403002905/http://www.sqlmag.com/blog/puzzled-by-t-sql-blog-15/tsql/packing-date-intervals-136831. – Jonathan Leffler Dec 20 '17 at 06:40
  • 1
    Why has that so many upvotes? For 2018-03-26 08:40:00.000 2018-03-26 17:09:00.000 2018-03-27 09:26:00.000 2018-03-27 13:36:00.000 2018-03-26 13:00:00.000 2018-03-27 09:26:00.000 2018-03-26 08:00:00.000 2018-03-26 08:40:00.000 It returns: 2018-03-26 08:00:00.000 2018-03-26 08:40:00.000 which is completely wrong. – Anytoe Mar 21 '18 at 09:42
  • @Anytoe - no it doesn't? It returns the same as my one adjusted for this case. http://rextester.com/ELEMK23524. A single range from `2018-03-26 08:00:00.000` to `2018-03-27 13:36:00.000` – Martin Smith Apr 14 '18 at 15:38
  • Unfortunately I can't get this to work when correlating with additional fields. – PeterX Feb 22 '21 at 07:11
14

You asked this back in 2010 but don't specify any particular version.

An answer for people on SQL Server 2012+

WITH T1
     AS (SELECT *,
                MAX(d2) OVER (ORDER BY d1) AS max_d2_so_far
         FROM   @T),
     T2
     AS (SELECT *,
                CASE
                  WHEN d1 <= DATEADD(DAY, 1, LAG(max_d2_so_far) OVER (ORDER BY d1))
                    THEN 0
                  ELSE 1
                END AS range_start
         FROM   T1),
     T3
     AS (SELECT *,
                SUM(range_start) OVER (ORDER BY d1) AS range_group
         FROM   T2)
SELECT range_group,
       MIN(d1) AS d1,
       MAX(d2) AS d2
FROM   T3
GROUP  BY range_group 

Which returns

+-------------+------------+------------+
| range_group |     d1     |     d2     |
+-------------+------------+------------+
|           1 | 2010-01-01 | 2010-06-13 |
|           2 | 2010-06-15 | 2010-08-16 |
|           3 | 2010-11-01 | 2010-12-31 |
+-------------+------------+------------+

DATEADD(DAY, 1 is used because your desired results show you want a period ending on 2010-06-25 to be collapsed into one starting 2010-06-26. For other use cases this may need adjusting.

Martin Smith
  • 438,706
  • 87
  • 741
  • 845
3

Here is a solution with just three simple scans. No CTEs, no recursion, no joins, no table updates in a loop, no "group by" — as a result, this solution should scale the best (I think). I think number of scans can be reduced to two, if min and max dates are known in advance; the logic itself just needs two scans — find gaps, applied twice.

declare @datefrom datetime, @datethru datetime

DECLARE @T TABLE (d1 DATETIME, d2 DATETIME)

INSERT INTO @T (d1, d2)

SELECT '2010-01-01','2010-03-31' 
UNION SELECT '2010-03-01','2010-06-13' 
UNION SELECT '2010-04-01','2010-05-31' 
UNION SELECT '2010-06-15','2010-06-25' 
UNION SELECT '2010-06-26','2010-07-10' 
UNION SELECT '2010-08-01','2010-08-05' 
UNION SELECT '2010-08-01','2010-08-09' 
UNION SELECT '2010-08-02','2010-08-07' 
UNION SELECT '2010-08-08','2010-08-08' 
UNION SELECT '2010-08-09','2010-08-12' 
UNION SELECT '2010-07-04','2010-08-16' 
UNION SELECT '2010-11-01','2010-12-31' 

select @datefrom = min(d1) - 1, @datethru = max(d2) + 1 from @t

SELECT 
StartDate, EndDate
FROM
(
    SELECT 
    MAX(EndDate) OVER (ORDER BY StartDate) + 1 StartDate,
    LEAD(StartDate ) OVER (ORDER BY StartDate) - 1 EndDate
    FROM
    (
        SELECT 
        StartDate, EndDate
        FROM
        (
            SELECT 
            MAX(EndDate) OVER (ORDER BY StartDate) + 1 StartDate,
            LEAD(StartDate) OVER (ORDER BY StartDate) - 1 EndDate 
            FROM 
            (
                SELECT d1 StartDate, d2 EndDate from @T 
                UNION ALL 
                SELECT @datefrom StartDate, @datefrom EndDate 
                UNION ALL 
                SELECT @datethru StartDate, @datethru EndDate
            ) T
        ) T
        WHERE StartDate <= EndDate
        UNION ALL 
        SELECT @datefrom StartDate, @datefrom EndDate 
        UNION ALL 
        SELECT @datethru StartDate, @datethru EndDate
    ) T
) T
WHERE StartDate <= EndDate

The result is:

StartDate   EndDate
2010-01-01  2010-06-13
2010-06-15  2010-08-16
2010-11-01  2010-12-31
Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
Oleg K
  • 31
  • 3
  • I'm puzzled by your claim of 'three simple scans'. I see 5 main levels of SELECT, simply counting down the indentation. Presumably, the three-way UNION queries don't involve extra scans for the extra two entries. I'm also a little puzzled by the use of the table alias `T` four times. However, since you don't actually qualify any column names with the alias, it probably doesn't matter. – Jonathan Leffler Nov 08 '18 at 01:57
  • Five SELECTs do not mean that SQL will scan table five times. Three scans were meant as logical ones - first to find max dates, second to find gaps and third to find gaps once again. Window functions are not allowed in WHERE clause, that's why sub-SELECTs are used. Anyway, the query gives correct results and it just logically simpler and also less expensive because no any joins or "group by" or CTEs are used, just basic table scans. – Oleg K Nov 08 '18 at 15:18
  • Also, I just compared resource usage or my query and the query which uses CTEs and "group by". My solution uses noticeably lower cpu_time, logical_reads and writes, I query sys.dm_exec_sessions for that. I added some more rows to table @T and compared both solutions. My solution: cpu_time=180, logical_reads=72,000, writes=142. Solution with CTEs: cpu_time=312, logical_reads=104,353, writes=211. – Oleg K Nov 08 '18 at 15:25
2

The idea is to simulate the scanning algorithm for merging intervals. My solution makes sure it works across a wide range of SQL implementations. I've tested it on MySQL, Postgres, SQL-Server 2017, SQLite and even Hive.

Assuming the table schema is the following.

CREATE TABLE t (
  a DATETIME,
  b DATETIME
);

We also assume the interval is half-open like [a,b).

When (a,i,j) is in the table, it shows that there are j intervals covering a, and there are i intervals covering the previous point.

CREATE VIEW r AS 
SELECT a,
       Sum(d) OVER (ORDER BY a ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) AS i,
       Sum(d) OVER (ORDER BY a ROWS UNBOUNDED PRECEDING) AS j
FROM  (SELECT a, Sum(d) AS d
       FROM   (SELECT a,  1 AS d FROM t
               UNION ALL
               SELECT b, -1 AS d FROM t) e
       GROUP  BY a) f;

We produce all the endpoints in the union of the intervals and pair up adjacent ones. Finally, we produce the set of intervals by only picking the odd-numbered rows.

SELECT a, b
FROM (SELECT a,
             Lead(a)      OVER (ORDER BY a) AS b,
             Row_number() OVER (ORDER BY a) AS n
      FROM   r
      WHERE  j=0 OR i=0 OR i is null) e
WHERE  n%2 = 1;

I've created a sample DB-fiddle and SQL-fiddle. I also wrote a blog post on union intervals in SQL.

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
Chao Xu
  • 2,156
  • 2
  • 22
  • 31
2

A Geometric Approach

Here and elsewhere I've noticed that date packing questions don't provide a geometric approach to this problem. After all, any range, date-ranges included, can be interpreted as a line. So why not convert them to a sql geometry type and utilize geometry::UnionAggregate to merge the ranges.

Why?

This has the advantage of handling all types of overlaps, including fully nested ranges. It also works like any other aggregate query, so it's a little more intuitive in that respect. You also get the bonus of a visual representation of your results if you care to use it. Finally, it is the approach I use for simultaneous range packing (you work with rectangles instead of lines in that case, and there are many more considerations). I just couldn't get the existing approaches to work in that scenario.

This has the disadvantage of requiring more recent versions of SQL Server. It also requires a numbers table and it's annoying to extract the individually produced lines from the aggregated shape. But hopefully in the future Microsoft adds a TVF that allows you to do this easily without a numbers table (or you can just build one yourself). Also, geometrical objects work with floats, so you have conversion annoyances and precision concerns to keep in mind.

Performance-wise I don't know how it compares, but I've done a few things (not shown here) to make it work for me even with large datasets.

Code Description

In 'numbers':

  • I build a table representing a sequence
  • Swap it out with your favorite way to make a numbers table.
  • For a union operation, you won't ever need more rows than in your original table, so I just use it as the base to build it.

In 'mergeLines':

  • I convert the dates to floats and use those floats to create geometrical points.
  • In this problem, we're working in 'integer space,' meaning there are no time considerations, and so an begin date in one range that is one day apart from an end date in another should be merged with that other. In order to make that merge happen, we need to convert to 'real space.', so we add 1 to the tail of all ranges (we undo this later).
  • I then connect these points via STUnion and STEnvelope.
  • Finally, I merge all these lines via UnionAggregate. The resulting 'lines' geometry object might contain multiple lines, but if they overlap, they turn into one line.

In the outer query:

  • I use the numbers CTE to extract the individual lines inside 'lines'.
  • I envelope the lines which here ensures that the lines are stored only as its two endpoints.
  • I read the endpoint x values and convert them back to their time representations, ensuring to put them back into 'integer space'.

The Code

with 

    numbers as (

        select  row_number() over (order by (select null)) i 
        from    @t

    ),

    mergeLines as (

        select      lines = geometry::UnionAggregate(line)
        from        @t
        cross apply (select line = 
                        geometry::Point(convert(float, d1), 0, 0).STUnion(
                            geometry::Point(convert(float, d2) + 1, 0, 0)
                        ).STEnvelope()
                    ) l

    )

    select      ap.StartDate,
                ap.EndDate
    from        mergeLines ml
    join        numbers n on n.i between 1 and ml.lines.STNumGeometries()
    cross apply (select line = ml.lines.STGeometryN(i).STEnvelope()) l
    cross apply (select 
                    StartDate = convert(datetime,l.line.STPointN(1).STX),
                    EndDate = convert(datetime,l.line.STPointN(3).STX) - 1
                ) ap
    order by    ap.StartDate;
pwilcox
  • 5,542
  • 1
  • 19
  • 31
  • This looks the best. But is it possible to use another column (let's call it group) and make the date packing for each of the group's values? – gastan Aug 25 '23 at 07:41
  • @gastan, It's been awhile, but yes, re-reading I do believe if you group by a new column in 'mergelines' and then again in the outer query it would produce results grouped on such-and-such column. The numbers CTE could be left as is, but optionally it can be put into a temp table instead and a different implementation used. – pwilcox Aug 29 '23 at 20:23
0

In this solution, I created a temporary Calendar table which stores a value for every day across a range. This type of table can be made static. In addition, I'm only storing 400 some odd dates starting with 2009-12-31. Obviously, if your dates span a larger range, you would need more values.

In addition, this solution will only work with SQL Server 2005+ in that I'm using a CTE.

With Calendar As
    (
    Select DateAdd(d, ROW_NUMBER() OVER ( ORDER BY s1.object_id ), '1900-01-01') As [Date]
    From sys.columns as s1
        Cross Join sys.columns as s2
    )
    , StopDates As
    (
    Select C.[Date]
    From Calendar As C
        Left Join @T As T
            On C.[Date] Between T.d1 And T.d2
    Where C.[Date] >= ( Select Min(T2.d1) From @T As T2 )
        And C.[Date] <= ( Select Max(T2.d2) From @T As T2 )
        And T.d1 Is Null
    )
    , StopDatesInUse As
    (
    Select D1.[Date]
    From StopDates As D1
        Left Join StopDates As D2
            On D1.[Date] = DateAdd(d,1,D2.Date)
    Where D2.[Date] Is Null
    )
    , DataWithEariestStopDate As 
    (
    Select *
    , (Select Min(SD2.[Date])
        From StopDatesInUse As SD2
        Where T.d2 < SD2.[Date] ) As StopDate
    From @T As T
    )
Select Min(d1), Max(d2)
From DataWithEariestStopDate
Group By StopDate
Order By Min(d1)

EDIT The problem with using dates in 2009 has nothing to do with the final query. The problem is that the Calendar table is not big enough. I started the Calendar table at 2009-12-31. I have revised it start at 1900-01-01.

Thomas
  • 63,911
  • 12
  • 95
  • 141
  • Your code is merging intervals that are not supposed to be merged. Using this initial intervals /**/ SELECT '2009-01-01', '2009-01-01' UNION SELECT '2009-01-03', '2009-01-03' /**/ the code returns a single period : 2009-01-01 to 2009-01-03. In this case 2009-01-02 should not be included in the resulting interval. – leoinfo Apr 07 '10 at 19:07
  • First, you should add the schema and specifically whether D1 = D2. None of your example data suggests that. Second, if you **add** {2010-01-01,2010-01-01}, to your existing example data, the first range should still be 2010-01-01 to 2010-06-13 because the first entry in your example covers 2010-01-01 to 2010-03-31. Third, if you instead **replace** the first entry in your example with {2010-01-01, 2010-01-01},{2010-03-01, 2010-03-01}, the results of my query are still correct. Making that change, the first two entries come out as {2010-01-01, 2010-01-01}, {2010-03-01, 2010-06-13}. – Thomas Apr 07 '10 at 20:40
  • One more scenario, if you replace all entries with only {2010-01-01,2010-01-01},{2010-03-01,2010-03-01}, you will get those same two entries. – Thomas Apr 07 '10 at 20:42
  • I see a typo in my edit but the results are still correct. If you replace the first entry with {2010-01-01,2010-01-01} and {2010-01-03,2010-01-03}, it does come out correctly in that it creates two ranges which exclude 2010-01-02. – Thomas Apr 07 '10 at 20:44
  • @leoinfo - Added this comment to ensure that you are notified. – Thomas Apr 07 '10 at 20:51
  • @leoinfo - My apologies. I notice now that your breaking example is in 2009. The problem is that in my original example, my Calendar table starts 2009-12-31. All that needs happen is to expand the Calendar table. I have adjusted. – Thomas Apr 08 '10 at 03:24
  • @Thomas - Your single-query code returns the right results. I run some tests though and the only issue now is that the code is very slow for more (like 500+) records, regardless of using temp table or variable table. On my test, merging 500 overlapping intervals it took about 42s/32s (Temp/Variable). Oddly is that merging 500 NON-overlapping intervals took even longer, 65s/161s. – leoinfo Apr 19 '10 at 16:38
  • @Thomas - With my solution, merging 500 overlapping intervals took about 37s/1s (Temp/Variable). Merging 500 NON-overlapping intervals took 150ms/56ms. – leoinfo Apr 19 '10 at 16:40
  • @Thomas - I wander if your code could be further optimized, because it really looks better than mine. Unfortunately, in this form, it seems to be too resource-consuming. – leoinfo Apr 19 '10 at 16:42
  • @leoinfo - Certainly one way to help is to make the Calendar table static. I.e., create the Calendar table outside of the query with a clustered PK on the date column. – Thomas Apr 19 '10 at 18:25
  • @leoinfo - Have you tried using a temp table and creating indexes on d1 and d2? I'm going to guess that the most expensive part of the query is the StopDatesInUse CTE section because of DateAdd which it may not be interpreting as SARGable. You might also consider enabling `DATE_CORRELATION_OPTIMIZATION` (see http://technet.microsoft.com/en-us/library/ms177416.aspx) which will help significantly with queries against the Calendar table. – Thomas Apr 19 '10 at 18:55
0

Try this

;WITH T1 AS
(
    SELECT d1, d2, ROW_NUMBER() OVER(ORDER BY (SELECT 0)) AS R
    FROM @T
), NUMS AS
(
    SELECT ROW_NUMBER() OVER(ORDER BY (SELECT 0)) AS R
    FROM T1 A
    CROSS JOIN T1 B
    CROSS JOIN T1 C
), ONERANGE AS 
(
    SELECT DISTINCT DATEADD(DAY, ROW_NUMBER() OVER(PARTITION BY T1.R ORDER BY (SELECT 0)) - 1, T1.D1) AS ELEMENT
    FROM T1
    CROSS JOIN NUMS
    WHERE NUMS.R <= DATEDIFF(DAY, d1, d2) + 1
), SEQUENCE AS
(
    SELECT ELEMENT, DATEDIFF(DAY, '19000101', ELEMENT) - ROW_NUMBER() OVER(ORDER BY ELEMENT) AS rownum
    FROM ONERANGE
)
SELECT MIN(ELEMENT) AS StartDate, MAX(ELEMENT) as EndDate
FROM SEQUENCE
GROUP BY rownum

The basic idea is to first unroll the existing data, so you get a separate row for each day. This is done in ONERANGE

Then, identify the relationship between how dates increment and the way the row numbers do. The difference remains constant within an existing range/island. As soon as you get to a new data island, the difference between them increases because the date increments by more than 1, while the row number increments by 1.

Chris Bednarski
  • 3,364
  • 25
  • 33
0

This Solution is similar to the 1st solution with additional Deletion Condition. This will sort the data in the main table itself instead of using different table to store the result.

DROP TABLE IF EXISTS #SampleTable;
CREATE TABLE #SampleTable (StartTime DATETIME NULL, EndTime DATETIME NULL);
INSERT INTO #SampleTable(StartTime, EndTime)
VALUES
    (N'2010-01-01T00:00:00', N'2010-03-31T00:00:00'),
    (N'2010-03-01T00:00:00', N'2010-06-13T00:00:00'),
    (N'2010-04-01T00:00:00', N'2010-05-31T00:00:00'),
    (N'2010-06-15T00:00:00', N'2010-06-25T00:00:00'),
    (N'2010-06-26T00:00:00', N'2010-07-10T00:00:00'),
    (N'2010-07-04T00:00:00', N'2010-08-16T00:00:00'),
    (N'2010-08-01T00:00:00', N'2010-08-05T00:00:00'),
    (N'2010-08-01T00:00:00', N'2010-08-09T00:00:00'),
    (N'2010-08-02T00:00:00', N'2010-08-07T00:00:00'),
    (N'2010-08-08T00:00:00', N'2010-08-08T00:00:00'),
    (N'2010-08-09T00:00:00', N'2010-08-12T00:00:00'),
    (N'2010-11-01T00:00:00', N'2010-12-31T00:00:00');
--
DECLARE @RowCount INT=0;
WHILE(1=1) --
BEGIN
    SET @RowCount=0;
    --
    UPDATE  T1
    SET T1.EndTime=T2.EndTime
    FROM    dbo.#SampleTable AS T1
            INNER JOIN dbo.#SampleTable AS T2 ON  DATEADD(DAY, 1, T1.EndTime) BETWEEN T2.StartTime AND T2.EndTime;
    --
    SET @RowCount=@RowCount+@@ROWCOUNT;
    --
    DELETE  T1
    FROM    dbo.#SampleTable AS T1
            INNER JOIN dbo.#SampleTable AS T2 ON T1.EndTime=T2.EndTime AND T1.StartTime>T2.StartTime;
    --
    SET @RowCount=@RowCount+@@ROWCOUNT;
    --
    IF  @RowCount=0 --
        BREAK;
END;

SELECT * FROM #SampleTable
0

I was inspired by the Geometric Approach given by pwilcox, but wanted to try a different approach. This is using Trino, but I hope the functions used can also be found in other versions of SQL.

WITH Geo AS (
   SELECT
      transform(                                                        -- 6) See Below~
        ST_Geometries(                                                  -- 5) Extracts an array of individual lines from the union.
          geometry_union(                                               -- 4) Returns the union of aggregated lines, melding all lines together into a single geometric multi-line.
            array_agg(                                                  -- 3) Aggregation function that joins all lines together.
              ST_LineString(                                            -- 2) Makes the pairs of geometric points into lines.
                ARRAY[ST_Point(0, to_unixtime(d1)), ST_Point(0, to_unixtime(d2))] -- 1) Takes unix time start and end values and makes them into an array of geometric points.
              )
            )
          )
        )
      , x -> (ST_YMin(x), ST_Length(x))) AS timestamp_duration          -- 6) From the array of lines, The minimum value and length of each line is extracted.
   FROM @T                                                              -- The miniumum value is a timestamp, length is duration.
   WHERE d1 <> d2                                                       -- I had errors any time this was the case.
)
-- 7) Finally, I unnest the produced array and convert the values back into timestamps.
SELECT from_unixtime(timestamp)            AS StartDate
     , from_unixtime(timestamp + duration) AS EndDate
FROM Geo
CROSS JOIN UNNEST(timestamp_duration) AS t(timestamp, duration)

For reference, this took my company cluster about 2 minutes to make 400k start/end timestamps into 700 distinct start/end timestamps. It also runs in just 2 stages.

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
BeRT2me
  • 12,699
  • 2
  • 13
  • 31
0

Thought I could contribute here as I've been working on a solution to this without having to self-join to improve performance - this was able to group 300,000 records into 100 groups in ~1 second on my local machine. Hope someone finds some use for it.

CREATE TYPE [dbo].[dateOverlapTable] AS TABLE(
    [GROUP_ID] [int] NOT NULL,
    [START_DTTM] [datetime] NOT NULL,
    [END_DTTM] [datetime] NOT NULL
)
GO

DECLARE @startTable AS [dbo].[dateOverlapTable]

INSERT INTO @startTable
--SELECT YOUR DATA HERE

/*
Replace the END_DTTM of each row in group with the maximum END_DTTM of records that started on the same or earlier START_DTTM.
This is to account for if one record is entirely contained within another - so the succeeding records will consider the later end date when checking for overlaps
*/
;WITH overlapEndDate AS (
    SELECT startTab.GROUP_ID
    , startTab.START_DTTM 

    , MAX(startTab.END_DTTM) OVER (PARTITION BY startTab.GROUP_ID ORDER BY startTab.START_DTTM rows between unbounded preceding and CURRENT ROW) [END_DTTM]
    FROM
    @startTable startTab
)

/*
Adds a [flag] to overlapEndDate:
1: If, when ordered by START_DTTM and partitioned over GROUP_ID, overlaps with previous overlapEndDate record
0: otherwise
*/
, overlapFlag AS (
    SELECT *
    , CASE
        WHEN LAG(END_DTTM) OVER (PARTITION BY GROUP_ID ORDER BY START_DTTM ASC) >= START_DTTM
            THEN 1
        ELSE 
            0
    END AS [flag]
    FROM
    overlapEndDate
)



/*
Add ROW_NUMBER as [grp] to [flag] = 0 records from overlapFlag
*/
, initialGroupings AS (
    SELECT *
    , CASE
        WHEN [flag] = 0
            THEN ROW_NUMBER() OVER (ORDER BY START_DTTM ASC)
        ELSE 
            NULL
    END AS [grp]
    FROM overlapFlag
)


/*
Now populate rows where grp is NULL. Populated with the previous grp value where flag = 0.
*/
, overlapGroupings AS (
    SELECT initGrp.GROUP_ID
    , initGrp.START_DTTM
    , initGrp.END_DTTM
    , CASE
        WHEN [flag] = 1
            THEN 
                MAX(initGrp.grp) OVER (PARTITION BY initGrp.GROUP_ID ORDER BY initGrp.START_DTTM ASC rows between unbounded preceding and CURRENT ROW)
            
        ELSE 
            initGrp.[grp]
    END AS [grp]
    FROM initialGroupings initGrp
)

    SELECT GROUP_ID
    , MIN(START_DTTM) AS START_DTTM
    , MAX(END_DTTM) AS END_DTTM
    FROM overlapGroupings
    GROUP BY GROUP_ID, grp
    ORDER BY GROUP_ID ASC, START_DTTM ASC