I have segmented linear data to identify the minimum and maximum begin and end values of each range, now I want to consolidate overlapping ranges. This is easy if the data is easy. For example, if the data is always increasing for both the begin and end values. (see the first 4 rows in my sample input below) But where the data is not as uniform, I'm having trouble computing the begin and end of a group of overlapping rows by relating data sets.
It's possible I'm headed down a rabbit hole because of an early decision about how to structure my inputs. The source data doesn't contain the rangebegin and rangeend columns I'm using for inputs. This was my choice to try to help me analyze the data. There may be other options.
Since the volume of information below may be an XY problem, here are the basics of what I'm trying to solve.
In my case, I'm trying to identify one-mile segments of roadway where there have been at least 5 crashes, then combine overlapping ranges to draw as a single geometric feature on a map. So...
Given this data:
Crash A at mile 2.22
Crash B at mile 2.24
Crash C at mile 3.10
Crash D at mile 3.92
Crash E at mile 5.03
Crash F at mile 5.21
I would have a segment from 2.22 to 3.92 because there is less than a mile between each crash in that range. Then there's a gap of more than a mile, then the last two crashes are within a mile of each other. So the output would look like
begin | end |
---|---|
2.22 | 3.92 |
5.03 | 5.21 |
Because of some other variables, the way I am combining the data thus far has produced some ranges that are not sequential. Some of the ranges are included in other ranges. That seems to be causing my biggest problems.
I'm using SQL Server (mostly 2016), but the concept should hold for any database system.
--- Update ---
I am rewriting the question to refer to a "known good" solution as a starting point.
SO moderators: If rewriting this question after receiving answers is wrong, let's fix it.
tl;dr: I'm developing a query that will likely ultimately be written as a series of common table expressions to serve as a data set for a reporting system. I am using a reporting system, not a database management tool like SSMS, so I can't run a "batch". I also don't believe a temporary stored procedure is appropriate, nor do I want to create objects (stored procedures, functions,...) on the database server to solve this single reporting problem. I do realize that the nature of the problem may dictate that I do these things, but I am searching for a query-only solution first.
On the surface, this problem appears to be a classic gaps-and-islands problem, for which I have been using a similar solution for years. But this one involves not only overlapping ranges, but also ranges that don't sort well and so don't work with the classic solution. Starting with https://bertwagner.com/posts/gaps-and-islands/, as recommended by Josh presents a problem. Other overlapping gaps-and-islands solutions I have seen appear to be the same, and so have the same problem. Replacing the last line of inputs in Bert Wagner's solution with...
SELECT '11/2/2017', '11/4/2017' UNION ALL
SELECT '10/27/2017', '11/15/2017' UNION ALL
SELECT '11/5/2017', '11/10/2017'
...helps demonstrate my problem.
Bert's solution is written as a deeply-nested set of subqueries. To improve readability, I would do this as a series of common table expressions.
In an attempt to demonstrate the problem and keep each step in my attempt simple and reproducible, I have broken this down to a series of temporary tables, each with its own outputs. These temporary tables equate to the common table expressions.
DROP TABLE IF EXISTS #OverlappingDateRanges;
CREATE TABLE #OverlappingDateRanges (StartDate date, EndDate date);
INSERT INTO #OverlappingDateRanges
SELECT '8/24/2017', '9/23/2017' UNION ALL
SELECT '8/24/2017', '9/20/2017' UNION ALL
SELECT '9/23/2017', '9/27/2017' UNION ALL
SELECT '9/25/2017', '10/10/2017' UNION ALL
SELECT '10/17/2017','10/18/2017' UNION ALL
SELECT '10/25/2017','11/3/2017' UNION ALL
SELECT '11/2/2017', '11/4/2017' UNION ALL
SELECT '10/27/2017', '11/15/2017' UNION ALL
SELECT '11/5/2017', '11/10/2017'
SELECT * FROM #OverlappingDateRanges;
StartDate EndDate
---------- ----------
2017-08-24 2017-09-23
2017-08-24 2017-09-20
2017-09-23 2017-09-27
2017-09-25 2017-10-10
2017-10-17 2017-10-18
2017-10-25 2017-11-03
2017-11-02 2017-11-04
2017-10-27 2017-11-15
2017-11-05 2017-11-10
DROP TABLE IF EXISTS #Groups;
SELECT ROW_NUMBER() OVER(ORDER BY StartDate,EndDate) AS RN
, StartDate
, EndDate
, LAG(EndDate,1) OVER (ORDER BY StartDate, EndDate) AS PreviousEndDate
INTO #Groups
FROM #OverlappingDateRanges;
SELECT *
FROM #Groups;
RN StartDate EndDate PreviousEndDate
------ ---------- ---------- ---------------
1 2017-08-24 2017-09-20 NULL
2 2017-08-24 2017-09-23 2017-09-20
3 2017-09-23 2017-09-27 2017-09-23
4 2017-09-25 2017-10-10 2017-09-27
5 2017-10-17 2017-10-18 2017-10-10
6 2017-10-25 2017-11-03 2017-10-18
7 2017-10-27 2017-11-15 2017-11-03
8 2017-11-02 2017-11-04 2017-11-15
9 2017-11-05 2017-11-10 2017-11-04 <-- Notice PreviousEndDate < StartDate
DROP TABLE IF EXISTS #Islands;
SELECT *
, CASE WHEN PreviousEndDate >= StartDate THEN 0 ELSE 1 END AS IslandStartInd
, SUM(CASE WHEN PreviousEndDate >= StartDate THEN 0 ELSE 1 END) OVER (ORDER BY RN) AS IslandId
INTO #Islands
FROM #Groups g;
SELECT *
FROM #Islands;
RN StartDate EndDate PreviousEndDate IslandStartInd IslandId
------ ---------- ---------- --------------- -------------- -----------
1 2017-08-24 2017-09-20 NULL 1 1
2 2017-08-24 2017-09-23 2017-09-20 0 1
3 2017-09-23 2017-09-27 2017-09-23 0 1
4 2017-09-25 2017-10-10 2017-09-27 0 1
5 2017-10-17 2017-10-18 2017-10-10 1 2
6 2017-10-25 2017-11-03 2017-10-18 1 3
7 2017-10-27 2017-11-15 2017-11-03 0 3
8 2017-11-02 2017-11-04 2017-11-15 0 3
9 2017-11-05 2017-11-10 2017-11-04 1 4 <-- This is within IslandId = 3
SELECT MIN(StartDate) AS IslandStartDate
, MAX(EndDate) AS IslandEndDate
FROM #Islands
GROUP BY IslandId
ORDER BY IslandStartDate;
IslandStartDate IslandEndDate
--------------- -------------
2017-08-24 2017-10-10
2017-10-17 2017-10-18
2017-10-25 2017-11-15
2017-11-05 2017-11-10 <-- This is not a distinct island.
As you can see, the 4th island should be included in the third.
And reviewing Bert's example again in detail, and rewriting it (I'm a tactile learner) has given me some other thoughts regarding how I might solve this. I am posting this here first to be fair to others who may have spent time reading and working on this and care about SO points.