-2

I have segmented linear data to identify the minimum and maximum begin and end values of each range, now I want to consolidate overlapping ranges. This is easy if the data is easy. For example, if the data is always increasing for both the begin and end values. (see the first 4 rows in my sample input below) But where the data is not as uniform, I'm having trouble computing the begin and end of a group of overlapping rows by relating data sets.

It's possible I'm headed down a rabbit hole because of an early decision about how to structure my inputs. The source data doesn't contain the rangebegin and rangeend columns I'm using for inputs. This was my choice to try to help me analyze the data. There may be other options.

Since the volume of information below may be an XY problem, here are the basics of what I'm trying to solve.

In my case, I'm trying to identify one-mile segments of roadway where there have been at least 5 crashes, then combine overlapping ranges to draw as a single geometric feature on a map. So...

Given this data:
Crash A at mile 2.22
Crash B at mile 2.24
Crash C at mile 3.10
Crash D at mile 3.92
Crash E at mile 5.03
Crash F at mile 5.21

I would have a segment from 2.22 to 3.92 because there is less than a mile between each crash in that range. Then there's a gap of more than a mile, then the last two crashes are within a mile of each other. So the output would look like

begin end
2.22 3.92
5.03 5.21

Because of some other variables, the way I am combining the data thus far has produced some ranges that are not sequential. Some of the ranges are included in other ranges. That seems to be causing my biggest problems.

I'm using SQL Server (mostly 2016), but the concept should hold for any database system.

--- Update ---

I am rewriting the question to refer to a "known good" solution as a starting point.

SO moderators: If rewriting this question after receiving answers is wrong, let's fix it.

tl;dr: I'm developing a query that will likely ultimately be written as a series of common table expressions to serve as a data set for a reporting system. I am using a reporting system, not a database management tool like SSMS, so I can't run a "batch". I also don't believe a temporary stored procedure is appropriate, nor do I want to create objects (stored procedures, functions,...) on the database server to solve this single reporting problem. I do realize that the nature of the problem may dictate that I do these things, but I am searching for a query-only solution first.

On the surface, this problem appears to be a classic gaps-and-islands problem, for which I have been using a similar solution for years. But this one involves not only overlapping ranges, but also ranges that don't sort well and so don't work with the classic solution. Starting with https://bertwagner.com/posts/gaps-and-islands/, as recommended by Josh presents a problem. Other overlapping gaps-and-islands solutions I have seen appear to be the same, and so have the same problem. Replacing the last line of inputs in Bert Wagner's solution with...

SELECT '11/2/2017', '11/4/2017' UNION ALL
SELECT '10/27/2017', '11/15/2017' UNION ALL
SELECT '11/5/2017', '11/10/2017'

...helps demonstrate my problem.

Bert's solution is written as a deeply-nested set of subqueries. To improve readability, I would do this as a series of common table expressions.

In an attempt to demonstrate the problem and keep each step in my attempt simple and reproducible, I have broken this down to a series of temporary tables, each with its own outputs. These temporary tables equate to the common table expressions.

DROP TABLE IF EXISTS #OverlappingDateRanges;
CREATE TABLE #OverlappingDateRanges (StartDate date, EndDate date);

INSERT INTO #OverlappingDateRanges
SELECT '8/24/2017', '9/23/2017'  UNION ALL
SELECT '8/24/2017', '9/20/2017'  UNION ALL 
SELECT '9/23/2017', '9/27/2017'  UNION ALL 
SELECT '9/25/2017', '10/10/2017' UNION ALL
SELECT '10/17/2017','10/18/2017' UNION ALL 
SELECT '10/25/2017','11/3/2017'  UNION ALL
SELECT '11/2/2017', '11/4/2017' UNION ALL
SELECT '10/27/2017', '11/15/2017' UNION ALL
SELECT '11/5/2017', '11/10/2017'

SELECT * FROM #OverlappingDateRanges;
StartDate  EndDate
---------- ----------
2017-08-24 2017-09-23
2017-08-24 2017-09-20
2017-09-23 2017-09-27
2017-09-25 2017-10-10
2017-10-17 2017-10-18
2017-10-25 2017-11-03
2017-11-02 2017-11-04
2017-10-27 2017-11-15
2017-11-05 2017-11-10
DROP TABLE IF EXISTS #Groups;
SELECT ROW_NUMBER() OVER(ORDER BY StartDate,EndDate) AS RN
, StartDate
, EndDate
, LAG(EndDate,1) OVER (ORDER BY StartDate, EndDate) AS PreviousEndDate
INTO #Groups
FROM #OverlappingDateRanges;

SELECT *
FROM #Groups;
RN     StartDate  EndDate    PreviousEndDate
------ ---------- ---------- ---------------
1      2017-08-24 2017-09-20 NULL
2      2017-08-24 2017-09-23 2017-09-20
3      2017-09-23 2017-09-27 2017-09-23
4      2017-09-25 2017-10-10 2017-09-27
5      2017-10-17 2017-10-18 2017-10-10
6      2017-10-25 2017-11-03 2017-10-18
7      2017-10-27 2017-11-15 2017-11-03
8      2017-11-02 2017-11-04 2017-11-15
9      2017-11-05 2017-11-10 2017-11-04    <-- Notice PreviousEndDate < StartDate
DROP TABLE IF EXISTS #Islands;
SELECT *
, CASE WHEN PreviousEndDate >= StartDate THEN 0 ELSE 1 END AS IslandStartInd
, SUM(CASE WHEN PreviousEndDate >= StartDate THEN 0 ELSE 1 END) OVER (ORDER BY RN) AS IslandId
INTO #Islands
FROM #Groups g;

SELECT *
FROM #Islands;
RN     StartDate  EndDate    PreviousEndDate IslandStartInd IslandId
------ ---------- ---------- --------------- -------------- -----------
1      2017-08-24 2017-09-20 NULL            1              1
2      2017-08-24 2017-09-23 2017-09-20      0              1
3      2017-09-23 2017-09-27 2017-09-23      0              1
4      2017-09-25 2017-10-10 2017-09-27      0              1
5      2017-10-17 2017-10-18 2017-10-10      1              2
6      2017-10-25 2017-11-03 2017-10-18      1              3
7      2017-10-27 2017-11-15 2017-11-03      0              3
8      2017-11-02 2017-11-04 2017-11-15      0              3
9      2017-11-05 2017-11-10 2017-11-04      1              4        <-- This is within IslandId = 3
SELECT MIN(StartDate) AS IslandStartDate
, MAX(EndDate) AS IslandEndDate
FROM #Islands
GROUP BY IslandId
ORDER BY IslandStartDate;
IslandStartDate IslandEndDate
--------------- -------------
2017-08-24      2017-10-10
2017-10-17      2017-10-18
2017-10-25      2017-11-15
2017-11-05      2017-11-10   <-- This is not a distinct island.

As you can see, the 4th island should be included in the third.

And reviewing Bert's example again in detail, and rewriting it (I'm a tactile learner) has given me some other thoughts regarding how I might solve this. I am posting this here first to be fair to others who may have spent time reading and working on this and care about SO points.

dougp
  • 2,810
  • 1
  • 8
  • 31
  • 2
    I dont have time to write the answer for this, but I have run into this before and the website that made it click for me is: https://bertwagner.com/posts/gaps-and-islands/ I hope this helps and I hope you get a proper answer... but in-case noone hits the problem here is where I went – Josh May 12 '22 at 20:51
  • 1
    That's basically what I'm already doing. I think that works because the values sort neatly. If I replace the inputs to match mine - `dateadd(d, rangebegin, {d '2021-12-31'})`, for example - the results don't work. https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=88b39269424ee7e0efe665c4628dcf77 – dougp May 12 '22 at 23:29
  • Relating to the two answers received so far: "I want to use relational data sets and avoid structured programming involving looping through the data a row at a time." I can absolutely write the looping logic. I'm trying to avoid it. It may be that using a temporary stored procedure is an option, but if I were the DBA I wouldn't give that much power to the SQL login that is used for this report. – dougp May 13 '22 at 19:28
  • @dougp my answer doesn't use a stored proc nor does it process the data `'one row at a time'`; the code consists of a single *batch* of SQL that (re)uses a single #temp table (as opposed to creating 4 #temp tables); there shouldn't be any permissions issues (aka extra 'power') required to run a SQL *batch* – markp-fuso May 13 '22 at 19:58
  • "I'm trying to produce a report" using a reporting system (Power BI, Cognos, etc.) So if I bundle your code into a single thing, that's a view? So I can run `SELECT * FROM [MARKP-FUSO ROUTINE]`? A view containing a `WHILE` loop would be new to my SQL experience. Or would this be a temporary stored procedure that produces a single data set as an output? So the reporting user would need to be able to create and run a temporary stored procedure? – dougp May 16 '22 at 15:47
  • well, if your reporting system allows you to create 4 temp tables (via `create table` and multiple `select/into` queries) then said reporting system is allowing you to submit (multiple?) SQL batches so it should allow you to submit the proposed answer/batch, yes? have you tried to use the provided answer and if so what happens? – markp-fuso May 16 '22 at 15:54
  • It doesn't. I had to boil down the code to something that was relatively easy to explain outside of the reporting system. The first temp table would actually be a query from the database. The rest would be CTEs. The code from the answer you provided will not work in the reporting system. – dougp May 16 '22 at 15:59
  • If the reporting system is allowed to send temporary stored procedures, it seems Joe User can write anything, including dynamic SQL, that could cause considerable havoc and insert it into an ad hoc report. That seems bad. – dougp May 16 '22 at 16:16
  • who said anything about 'temporary stored procedures'? your latest comments suggest your entire question (multiple queries, multiple temp tables) is now void; you should be stating upfront the environment under which you're working, eg, the reporting system(s) you're using and what now sounds like some sort of limitation of only being able to submit a single, standalone query – markp-fuso May 16 '22 at 16:35
  • A temporary stored procedure would be the only way to implement your solution given my environmental constraints. I did state the constraint -- that I want relational data sets, not looping, to solve the problem. I'm not sure why the details behind that are important. Nevertheless, I have updated my question. – dougp May 16 '22 at 22:13

3 Answers3

0

I'm not sure how to put everything in one query but using adhoc/Stored procedure can solve this problem.

I have tested in my Sybase ASE, please try same approach with your SQL/Oracle.

Hope this help.

1. Add column SEQ (auto increment id) into source table

SEQ RANGE
1 2.22
2 2.24
3 3.10
4 3.92
5 5.03
6 5.21

2. Looping and calculate difference between 2 records

 CREATE TABLE TEST_RESULT
(
    BEGIN_MILE NUMERIC(10,2),
    END_MILE NUMERIC(10,2)
)

--Begin ad-hoc code
DECLARE @iLoop numeric(10), @recNo numeric(10), @sMsg VARCHAR(100), @tRANGE numeric(10,2), @rangebegin numeric(10,2), @rangeend numeric(10,2)

SET @recNo = (SELECT COUNT(*) FROM LLE_TEST_SEQ)
SET @iLoop = 3

--Set 1st begin
SET @rangebegin = (SELECT RANGE FROM LLE_TEST_SEQ WHERE SEQ = 1)
--Set 1st end
SET @rangeend = (SELECT RANGE FROM LLE_TEST_SEQ WHERE SEQ = 2)

WHILE (@iLoop <= @recNo)
    BEGIN
        SET @tRANGE = (SELECT RANGE FROM LLE_TEST_SEQ WHERE SEQ = @iLoop)
        
        --Display for testing
        PRINT '*****'
        SET @sMsg = CONVERT(VARCHAR, @rangeend)
        PRINT @sMsg     
        SET @sMsg = CONVERT(VARCHAR, @tRANGE)
        PRINT @sMsg
        PRINT '*****'
        
        --If delta < 1, increase end point
        IF @tRANGE - @rangeend < 1
            BEGIN
                SET @rangeend = @tRANGE
            END
        --Otherwise, break, insert and create new set
        ELSE
            BEGIN
                INSERT INTO TEST_RESULT (BEGIN_MILE, END_MILE) VALUES (@rangebegin, @rangeend)
                SET @rangebegin = @tRANGE
                SET @rangeend = @tRANGE
            END     
        
        
        SET @iLoop = @iLoop + 1
        
    END
INSERT INTO TEST_RESULT (BEGIN_MILE, END_MILE) VALUES (@rangebegin, @rangeend)
--End ad-hoc code

Output

SELECT * FROM TEST_RESULT
BEGIN_MILE END_MILE
2.22 3.92
5.03 5.21
Hana
  • 824
  • 4
  • 14
  • 30
0

Adding a column to #a and adding a few more data rows to address some scenarios not covered in original data set:

drop table if exists #a

create table #a
(pass        int
,category    varchar(5)
,rangebegin  int
,rangeend    int)

insert #a
values 
  (0, 'a', 1, 19)
, (0, 'a', 1, 27)
, (0, 'a', 9, 37)
, (0, 'a', 14, 42)
, (0, 'a', 45, 65)
, (0, 'a', 47, 70)
, (0, 'a', 52, 62)
, (0, 'a', 65, 65)

, (0, 'a', 81, 83)       -- standalone row/range
, (0, 'a', 95, 95)       -- standalone row/range; begin == end

, (0, 'b', 3, 33)
, (0, 'b', 17, 22)
, (0, 'b', 21, 44)

, (0, 'c', 30, 35)       -- no two rows
, (0, 'c', 33, 40)       -- define the
, (0, 'c', 39, 42)       -- entire range

One idea using a loop to iteratively consolidate ranges:

declare @pass           int,
        @prevcount      int,
        @currcount      int

select  @pass = 0,
        @prevcount = (select count(*) from #a where pass = 0),
        @currcount = 0

while   @currcount != @prevcount
begin
        insert  #a
                (pass, category, rangebegin, rangeend)
        select  distinct
                @pass + 1,
                a1.category,
                (select min(a2.rangebegin)
                 from   #a a2
                 where  a2.category     = a1.category
                 and    a2.pass         = @pass
                 and    (       a1.rangebegin   between  a2.rangebegin and a2.rangeend
                         or     a1.rangeend     between  a2.rangebegin and a2.rangeend)),
                (select max(a3.rangeend)
                 from   #a a3
                 where  a3.category     = a1.category
                 and    a3.pass         = @pass
                 and    (       a1.rangebegin   between  a3.rangebegin and a3.rangeend
                         or     a1.rangeend     between  a3.rangebegin and a3.rangeend))
        from    #a a1
        where   a1.pass = @pass

        select  @pass      = @pass + 1,
                @prevcount = @currcount,
                @currcount = (select count(*) from #a where pass = @pass+1)

--      print '############# %1!', @pass
--      select  * from #a where pass = @pass order by 2,3,4
end

select  category,
        rangebegin,
        rangeend
from    #a
where   pass = @pass
order by 1,2,3

This generates:

 category rangebegin  rangeend
 -------- ----------- -----------
 a                  1          42
 a                 45          70
 a                 81          83
 a                 95          95
 b                  3          44
 c                 30          42

NOTES:

  • the 'final' solution is actually determined twice due to looping until @currcount = @prevcount
  • designed/tested on a SAP(Sybase) ASE 16 instance (ASE does not have support for windows functions)
  • then verified the code also works in SQL Server 2019 (fiddle)
  • it may be possible to rewrite this using a recursive CTE ... maybe? yes? no? shrug (I'm a bit rusty with recursive CTEs)
  • for larger volumes of data OP may want to look at indexing #a
  • once a new set of data is inserted into #a the 'previous' set of data could be deleted as said data will no longer be needed ... OP's choice
markp-fuso
  • 28,790
  • 4
  • 16
  • 36
  • Please explain how this can be written into a view. The `WHILE` statement makes this look to me like a stored procedure or function. – dougp May 16 '22 at 15:38
  • I tried using a recursive CTE. The problem I had is that aggregate functions (`LEAD` and `LAG`) can't be used in a recursive CTE -- at least in SQL Server. – dougp May 16 '22 at 15:41
  • it's not a view ... it's not a stored proc ... it's not a function ... it's a SQL batch with a `while` loop; this particular answer (obviously) does not need to use `lead/lag` – markp-fuso May 16 '22 at 15:53
  • I understand the structure of what you developed. It's meant to be run in a session in a database management tool like SSMS. I do that all the time for initial development. But that's not the environment I'm working in. Having never seen a `WHILE` loop used in a view, I don't understand how what you shared can be rewritten as a view. Not that I need a view, but I need a query that would have the same behavior. If it can't be written as a view, I don't think I can use it in my environment. – dougp May 16 '22 at 22:19
0

Since we're sorting by StartDate and EndDate, we should be able to use the minimum of this and all future start dates and the maximum of this and all past end dates. This is just one more step beyond what Bert Wagner published.

DROP TABLE IF EXISTS #OverlappingDateRanges;
CREATE TABLE #OverlappingDateRanges (StartDate date, EndDate date);

INSERT INTO #OverlappingDateRanges
SELECT '8/24/2017', '9/23/2017'  UNION ALL
SELECT '8/24/2017', '9/20/2017'  UNION ALL 
SELECT '9/23/2017', '9/27/2017'  UNION ALL 
SELECT '9/25/2017', '10/10/2017' UNION ALL
SELECT '10/17/2017', '10/18/2017' UNION ALL 
SELECT '10/25/2017', '11/3/2017'  UNION ALL
SELECT '11/2/2017', '11/4/2017' UNION ALL
SELECT '10/27/2017', '11/15/2017' UNION ALL
SELECT '11/5/2017', '11/10/2017'


;
WITH
Ranges as (
    SELECT MIN(StartDate) OVER (ORDER BY StartDate, EndDate ROWS BETWEEN 0 FOLLOWING AND UNBOUNDED FOLLOWING) as StartDate
    , MAX(EndDate) OVER (ORDER BY StartDate, EndDate ROWS BETWEEN UNBOUNDED PRECEDING AND 0 PRECEDING) as EndDate
    FROM #OverlappingDateRanges
),
Groups as (
    SELECT StartDate
    , EndDate
    , LAG(StartDate,1) OVER (ORDER BY StartDate, EndDate) AS PreviousStartDate
    , LAG(EndDate,1) OVER (ORDER BY StartDate, EndDate) AS PreviousEndDate
    FROM Ranges
),
Islands as (
    SELECT StartDate
    , EndDate
    , CASE WHEN PreviousEndDate >= StartDate THEN 0 ELSE 1 END AS IslandStartInd
    , SUM(CASE WHEN PreviousEndDate >= StartDate THEN 0 ELSE 1 END) OVER (ORDER BY StartDate, EndDate) AS IslandId
    FROM Groups
)

SELECT MIN(StartDate) AS IslandStartDate
, MAX(EndDate) AS IslandEndDate
FROM Islands
GROUP BY IslandId
ORDER BY IslandStartDate;
IslandStartDate IslandEndDate
--------------- -------------
2017-08-24      2017-10-10
2017-10-17      2017-10-18
2017-10-25      2017-11-15

I'm not finding any start/end combinations that are causing problems. I'll let this marinate for a few days before marking it as correct in case somebody can poke some holes in it.

dougp
  • 2,810
  • 1
  • 8
  • 31