Finding Islands With Group and Interval Data

Question

I'm using SQL Server 2008R2 in this problem. Here's an example dataset:

WIRE_ID     FROM    TO      CLASS
05485       0.000   1.520   PL
05485       1.520   3.050   PL
05485       3.050   22.250  SL
05485       3.050   22.250  SP
05485       22.250  33.530  SL
05485       22.250  33.530  QT
05485       33.530  43.580  QT
05485       43.580  52.580  PL
05485       52.580  57.910  QT
114161      0.000   3.000   SW
114161      3.000   5.000   SL
114161      5.000   6.000   SL
114161      6.000   9.412   YN
114161      9.412   10.549  YN
114161      10.549  12.375  CM
114161      12.375  14.438  SL
114161      14.438  15.126  SL

So, a non-sequential ID associated ranged values and a group/classification. As you can see you can sometimes have duplicate intervals as different classes may be applied. Ultimately the result I'd like to achieve would look like the following:

WIRE_ID     FROM    TO      CLASS
05485       0.000   3.050   PL
05485       3.050   22.250  SL
05485       3.050   22.250  SP
05485       22.250  33.530  SL
05485       22.250  43.580  QT
05485       43.580  52.580  PL
05485       52.580  57.910  QT
114161      0.000   3.000   SW
114161      3.000   6.000   SL
114161      6.000   10.549  YN
114161      10.549  12.375  CM
114161      12.375  15.126  SL

Seems easy at first and I've constructed a solution that works, but once I apply it to the entire data-set it grinds to a halt. Ideally I need a solution that can handle a million rows of this style of data in a more or less efficient manner... Here's my solution:

Declare @WIRE_CLASS Table(WIRE_ID varchar(25), [FROM] float, [TO] float, CLASS varchar(15));
Insert @WIRE_CLASS(WIRE_ID, [FROM], [TO], CLASS) Values
('05485',0.000,1.520,'PL'),
('05485',1.520,3.050,'PL'),
('05485',3.050,22.250,'SL'),
('05485',3.050,22.250,'SP'),
('05485',22.250,33.530,'SL'),
('05485',22.250,33.530,'QT'),
('05485',33.530,43.580,'QT'),
('05485',43.580,52.580,'PL'),
('05485',52.580,57.910,'QT'),
('114161',0.000,3.000,'SW'),
('114161',3.000,5.000,'SL'),
('114161',5.000,6.000,'SL'),
('114161',6.000,9.412,'YN'),
('114161',9.412,10.549,'YN'),
('114161',10.549,12.375,'CM'),
('114161',12.375,14.438,'SL'),
('114161',14.438,15.126,'SL');

;with WIRE AS (
SELECT
    WIRE_ID, 
    FROM,
    TO,
    CLASS
FROM 
    WIRE_CLASS
), ISLANDS AS (
SELECT
    ROW_NUMBER() OVER (ORDER BY WI.WIRE_ID, WI.FROM) ID,
    WI.WIRE_ID, 
    WI.FROM,
    WI.TO,
    WI.CLASS,
    CASE WHEN WI2.WIRE_ID IS NULL THEN 1 ELSE 0 END BREAKER
FROM 
    WIRE WI 
    LEFT JOIN WIRE WI2 ON
        WI2.WIRE_ID = WI.WIRE_ID
        AND (WI2.TO = WI.FROM) 
        AND WI2.CLASS = WI.CLASS

), DATA AS(
SELECT 
    IS1.WIRE_ID, IS1.FROM, IS1.TO, IS1.CLASS,
    (SELECT sum(BREAKER) FROM ISLANDS IS2 WHERE IS1.ID >=  IS2.ID) BREAKER
FROM ISLANDS IS1
)
SELECT 
    DA.WIRE_ID,
    MIN(DA.FROM),
    MAX(DA.TO),
    MIN(DA.CLASS)
FROM DATA DA
GROUP BY 
    DA.WIRE_ID, 
    BREAKER,
    DA.CLASS
ORDER BY 
    DA.WIRE_ID,
    MIN(DA.[FROM]),
    MAX(DA.[TO])

Can you suggest a better way to do this??? Thanks a bunch SQL gurus!

Harry, where did you get this problem from ? Is this from some tutorial or a text book ? How many rows do you have and how much time does it take for this query to come to a halt ? Btw, what is the logic for assigning a class to a combination of wire_id, to and from ? What is the purpose of this table ? Per your logic, `05485 43.580 52.580 PL` should be combined with `05485 0.000 3.050 PL` and `05485 1.520 3.050 PL` — Erran Morad, Sep 27 '14 at 19:38
The problem is from some data I've inherited, I'm set with the structure so that can't change. For 918 records you're talking about 9 seconds to run. For 1878 it's 30 seconds. For 3988 it's 2min42secs. For 50132 things grind to a halt and after 30 minutes I kill the query. — Harry, Oct 02 '14 at 01:06
Can you publish your large data set somewhere so people can try their solutions on the scale that you want? — Code Different, Oct 02 '14 at 15:59
Can you please explain why `(05485,3.050,22.250,SL)` and `(05485,22.250,33.530,SL)` are not the same island in your desired output. I fail to see the logic behind this — Lorentz Vedeler, Oct 02 '14 at 17:13
Can an island contain more then 2 rows in the full data set ? In your example, the maximum seems to be 2 rows per "grouping" — r.m, Oct 02 '14 at 17:14
Just to elaborate on @LorentzVedeler's comment, the problem is there's no criterion specifying the order of rows, and so, when there are duplicate ranges, there is no way to tell which `CLASS` should go first. Rows 3 to 5 in your sample go as `SL`, `SP`, `SL` and the first two of them are the same range. If they went `SP`, `SL`, `SL`, the two `SL` rows would have condensed to a single row in the output, same as with rows 6 and 7 (`QT`) – those are not split by an `SL` row, but the `SL` rows *are* split by an `SP` row. Why? What criterion determines that? — Andriy M, Oct 05 '14 at 07:15
Are your from and to always only significant to three digits? — Kyle Hale, Oct 05 '14 at 13:16
A lot of questions, thanks for all the input on this problem! I will try to address them now. Zoff Dino - Unfortunately I can't publish the large data set as it's IP is sensitive. Lorentz Vedeler - The intervals (05485,3.050,22.250,SL) and (05485,22.250,33.530,SL) aren't combined mostly due to the fact that "duplicate" or overlapping intervals can exist (I.E. 05485, 3.050, 22.250, SP). This was a mistake on my behalf as this interval should be combined if possible. Islands can and do contain more than 2 rows of data. — Harry, Oct 08 '14 at 04:25
Andriy M - The criteria that controls the order of the rows is simple arbitrary user input. There's no way to tell or in fact order (unless you go alphanumeric on class) in our case. Kyle Hale - Yes, 3 digits is the max precision for those two fields. — Harry, Oct 08 '14 at 04:29

score 4 · Accepted Answer · answered Oct 07 '14 at 07:47

As others have said, using a temp table instead of a table variable will make a world of difference to the performance. I assume however, you've only used the table variable to pose your example and in reality you have a proper user table that contains all this data.

My query will find the islands, but will provide slightly different output to your results from your sample data. My query will not split islands as it appears yours has in some instances... I see a discrepancy in your results between the way the SL/SP/QT results are handled for wire 05485. I think

WIRE_ID FROM TO CLASS 05485 3.050 22.250 SL 05485 3.050 22.250 SP 05485 22.250 33.530 SL 05485 22.250 33.530 QT 05485 33.530 43.580 QT

should result in

WIRE_ID FROM TO CLASS 05485 3.050 33.530 SL 05485 3.050 22.250 SP 05485 22.250 43.580 QT

not

WIRE_ID FROM TO CLASS 05485 3.050 22.250 SL 05485 3.050 22.250 SP 05485 22.250 33.530 SL 05485 22.250 43.580 QT

Your results have split the SL island, but not the QT. Even though there is from/to pairs matching between SL & SP (3.05/22.5) and then SL & QT (22.5/33.53).

I've run the query below on a VM on my laptop (read: a server should be quicker) and for 1M records, with ~2000 Wire/Class combinations and ~20% rows needing to have from/to combined. I've randomly generated the data a few times and it typically takes only between 80 and 100 seconds.

I've created the table as:

IF OBJECT_ID('tempdb.dbo.#WIRE_CLASS', 'U') IS NOT NULL DROP TABLE #WIRE_CLASS GO CREATE TABLE #WIRE_CLASS ( WIRE_ID varchar(25), [FROM] float, [TO] float, CLASS varchar(15), PRIMARY KEY (WIRE_ID, [FROM], CLASS) )

Here's the query, with explanations in comments

-- The Cross Join ensures we always have a pair of first and last from/to pairs -- The left join matches all from=to combinations, -- allowing the where clause to restrict to just the first and last -- These first/last pairs are then grouped in the CTE -- The final select is then quite simple ; With GroupedData AS ( SELECT (Row_Number() OVER (ORDER BY W1.WIRE_ID, W1.CLASS, W1.[FROM]) - 1) / 2 Grp, W1.WIRE_ID, W1.[FROM], W1.[TO], W1.CLASS FROM #WIRE_CLASS W1 CROSS JOIN (SELECT 0 AS [First] UNION SELECT 1) SetOrder LEFT OUTER JOIN #WIRE_CLASS W2 ON W1.WIRE_ID = W2.WIRE_ID AND W1.CLASS = W2.CLASS AND ((W1.[TO] = W2.[FROM] AND [First] = 0) OR (W2.[TO] = W1.[FROM] AND [First] = 1)) WHERE W2.WIRE_ID IS NULL ) SELECT WIRE_ID, MIN([FROM]) AS [FROM], MAX([TO]) AS [TO], CLASS FROM GroupedData GROUP BY Grp, WIRE_ID, CLASS ORDER BY WIRE_ID, [FROM], CLASS

Hi Scott! This is looking good so far! Just tested it on a 4000 subset in a table with >4mill records. Took about 3m:45s - still slow but on the plus side didn't cause my database to start deadlocking like my solution. Will give it some more testing and report back my findings. — Harry, Oct 08 '14 at 04:43
Actually I had made a mistake in one of my linkages applying your query back to my dataset. After correcting it the same 4000 subset took mere milliseconds to run. A larger dataset of 400K+ records took 10 seconds. Doing some final QA over the results but this looks like the winner for me. ++For seeing my initial error in not combining certain islands with overlaps. :) — Harry, Oct 08 '14 at 05:43
All good from me. Gone above the call of duty to present a much more effective and efficient solution (and correct!) than my own. Very humbly obliged. — Harry, Oct 08 '14 at 05:54
Thanks for the bounty on this Horaciux, pretty sure it would have gone unanswered without it! :) — Harry, Oct 14 '14 at 01:28

score 1 · Answer 2 · edited May 23 '17 at 12:32

Using temporary tables instead of CTE's with large datasets will improve the performance, but without your dataset I'm unable to test the performance of this.

Here's your queries using temp tables:

Declare @WIRE_CLASS Table(WIRE_ID varchar(25), 
                          [FROM] float, 
                          [TO] float, 
                          CLASS varchar(15));

Insert @WIRE_CLASS(WIRE_ID, [FROM], [TO], CLASS) 
Values ('05485',0.000,1.520,'PL'),
       ('05485',1.520,3.050,'PL'),
       ('05485',3.050,22.250,'SL'),
       ('05485',3.050,22.250,'SP'),
       ('05485',22.250,33.530,'SL'),
       ('05485',22.250,33.530,'QT'),
       ('05485',33.530,43.580,'QT'),
       ('05485',43.580,52.580,'PL'),
       ('05485',52.580,57.910,'QT'),
       ('114161',0.000,3.000,'SW'),
       ('114161',3.000,5.000,'SL'),
       ('114161',5.000,6.000,'SL'),
       ('114161',6.000,9.412,'YN'),
       ('114161',9.412,10.549,'YN'),
       ('114161',10.549,12.375,'CM'),
       ('114161',12.375,14.438,'SL'),
       ('114161',14.438,15.126,'SL');

SELECT
    WIRE_ID, 
    [FROM],
    [TO],
    CLASS
INTO #tmp_WIRE
FROM 
    @WIRE_CLASS

SELECT
    ROW_NUMBER() OVER (ORDER BY WI.WIRE_ID, WI.[FROM]) ID,
    WI.WIRE_ID, 
    WI.[FROM],
    WI.[TO],
    WI.CLASS,
    CASE WHEN WI2.WIRE_ID IS NULL THEN 1 ELSE 0 END  as BREAKER
INTO #tmp_ISLANDS
FROM 
    #tmp_WIRE WI 
    LEFT JOIN #tmp_WIRE WI2 ON
        WI2.WIRE_ID = WI.WIRE_ID
        AND (WI2.[TO] = WI.[FROM]) 
        AND WI2.CLASS = WI.CLASS

SELECT 
    IS1.WIRE_ID, IS1.[FROM], IS1.[TO], IS1.CLASS,
    (SELECT sum(BREAKER) FROM #tmp_ISLANDS IS2 WHERE IS1.ID >=  IS2.ID) BREAKER
INTO #tmp_DATA
FROM #tmp_ISLANDS IS1

SELECT 
    DA.WIRE_ID,
    MIN(DA.[FROM]),
    MAX(DA.[TO]),
    MIN(DA.CLASS)
FROM #tmp_DATA DA
GROUP BY 
    DA.WIRE_ID, 
    BREAKER,
    DA.CLASS
ORDER BY 
    DA.WIRE_ID,
    MIN(DA.[FROM]),
    MAX(DA.[TO])

Related Reading:

Which are more performant, CTE or temporary tables?

What's the difference between a CTE and a Temp Table?

Harry, could you provide the schema used for the table(s) being used so Tanner can test the upper bounds of the amount of data that would be able to be stored with a given number of rows? — Drew, Oct 05 '14 at 11:54
The temporary tables were only to demonstrate the dataset, I'm actually querying real tables. :) — Harry, Oct 08 '14 at 04:31

score 0 · Answer 3 · answered Oct 04 '14 at 01:32

Note: The query is in Oracle. You will have to convert it

The bottleneck in your case is multiple self join of such large temporary tables. A different approach could do this without joining. First lets try to group overlapping rows. For this I used a sequence and 2 functions. The functionality of the 2 functions is the same as nextval and currval. The query is

  SELECT WIRE_ID,
     FROM_,
     TO_,
     CLASS,
     CASE
        WHEN LAG (TO_, 1, -999999)
                OVER (PARTITION BY WIRE_ID, CLASS ORDER BY FROM_) >= FROM_
        THEN
           CURRENTVALUE ()
        ELSE
           NEXTVALUE ()
     END
        GRP
FROM WIRE_CLASS
ORDER BY WIRE_ID, CLASS, FROM_

The output is

"unable to upload image"

Now just use this output as inline view and aggregate. Query is

  SELECT WIRE_ID,
     MIN (FROM_) AS FROM_,
     MAX (TO_) AS TO_,
     CLASS
FROM (SELECT WIRE_ID,
             FROM_,
             TO_,
             CLASS,
             CASE
                WHEN LAG (TO_, 1, -999999)
                        OVER (PARTITION BY WIRE_ID, CLASS ORDER BY FROM_) >=
                        FROM_
                THEN
                   CURRENTVALUE ()
                ELSE
                   NEXTVALUE ()
             END
                AS GRP
        FROM WIRE_CLASS) BASE_Q
GROUP BY WIRE_ID, CLASS, GRP
ORDER BY WIRE_ID, FROM_;

The final output is

"unable to upload image"

The functions are used because Oracle does not allow sequences to be used in conditional statements. Here are the definition of functions

CREATE OR REPLACE FUNCTION NEXTVALUE
RETURN NUMBER
IS
VAL   NUMBER;
BEGIN
SELECT WIRE.NEXTVAL INTO VAL FROM DUAL;
RETURN VAL;
END;

CREATE OR REPLACE FUNCTION CURRENTVALUE
RETURN NUMBER
IS
VAL   NUMBER;
BEGIN
SELECT WIRE.CURRVAL INTO VAL FROM DUAL;
RETURN VAL;
END;

Hope it helps.

The core of the solution is redesigning the query. As I am not aware of the SQL server specific syntax so I wrote it in Oracle. Not using any Oracle specific functions here. — Neerav Kumar, Oct 05 '14 at 23:29
Thanks for your input but `DUAL`, `CURRVAL`, `LAG` are definitely Oracle specific. Unfortunately SQL 2008 definitely doesn't have `LAG` (it's very annoying) — Nick.Mc, Oct 05 '14 at 23:34

Drew · Answer 4 · 2014-10-05T12:19:09.990

You would want the be sure that all keys are indexed. Additionally if you're matching up columns of different types that's going to yield a performance hit.

Additionally, where you have the row_number function that's going to change the whole to at least O(2n) efficiency, that would be in part, why you're seeing such long run times with your alg

And as Tanner had written, a move to temp tables would yield a performance improvement

score 0 · Answer 5 · answered Oct 06 '14 at 07:17

You can use the OVER() clause of the MIN and MAX statements.

Below you will find a working sample (including the generation of a decent test data set).

-- create a temp table to hold a decent amount of testdata
IF OBJECT_ID('tempdb..#testdata') IS NOT NULL 
    DROP TABLE #testdata;

CREATE TABLE #testdata
    (
      Id INT PRIMARY KEY ,
      WIRE_ID INT NOT NULL ,
      _FROM FLOAT NOT NULL ,
      _TO FLOAT NOT NULL ,
      CLASS VARCHAR(15) NOT NULL
    );

-- create some testdata
-- : 1KK records, divided over 10K WIRE_ID's (so 100 values per wire-id)
-- : FROM-TO values randomly divided into 4 value ranges [0-25][26-50][51-75][76-100]
-- : CLASS  bucketnumber + randomly classification from 1 to 5 (cls_1_2 => cls_[bucketnum]_[rand between 1 and 5]
DECLARE @nrows INT = 1000000;
DECLARE @nwires INT = 100;
DECLARE @numbuckets INT = 4;
DECLARE @bucketsize INT= 25000;
DECLARE @bucketstartMax INT= 5000;
DECLARE @clsMax INT = 5;

WITH    TestRecs
          AS ( SELECT TOP ( @nrows )
                        ROW_NUMBER() OVER ( ORDER BY ( SELECT 1) ) AS Rnum ,
                        bucketNumber = ABS(CHECKSUM(NEWID()) % @numbuckets) ,
                        fromStart = ABS(CHECKSUM(NEWID()) % @bucketstartMax) ,
                        fromIncrement = ABS(CHECKSUM(NEWID()) % ( @bucketsize- ABS(CHECKSUM(NEWID())% @bucketstartMax) )) ,
                        clsNum = ABS(CHECKSUM(NEWID()) % @clsMax) + 1
               FROM     sys.all_columns c1
                        CROSS JOIN sys.all_columns c2
                        CROSS JOIN sys.all_columns c3
             )
    INSERT  INTO #testdata
            ( Id ,
              WIRE_ID ,
              [_FROM] ,
              [_TO] ,
              CLASS 
            )
            SELECT  RNum AS _id ,
                    CAST(RNum / @nwires + 1 AS INT) AS Wire_ID ,
                    _from = ( ( bucketNumber * @bucketsize ) + fromStart ) / CAST(1000 AS FLOAT) ,
                    _to = ( ( bucketNumber * @bucketsize ) + fromStart + fromIncrement ) / CAST(1000 AS FLOAT) ,
                    _cls = 'cls' + CAST(bucketNumber AS VARCHAR(3)) + '_' + CAST(clsNum AS VARCHAR(5))
            FROM    TestRecs;

-- selecting the results can be achieved using OVER() clauses on MIN() and MAX()
-- note this forces you to include _from and _to in GROUP BY
-- , so you need a 2nd select is required to get the distict list
WITH    Bucketized
          AS ( SELECT   WIRE_ID ,
                        CLASS ,
                        MIN(_FROM) OVER ( PARTITION BY WIRE_ID, CLASS ) AS _FROM ,
                        MAX(_TO) OVER ( PARTITION BY WIRE_ID, CLASS ) AS _TO
               FROM     #testdata
               GROUP BY WIRE_ID ,
                        CLASS ,
                        _FROM ,
                        _TO
             )
    SELECT  WIRE_ID ,
            CLASS ,
            _FROM ,
            _TO
    FROM    Bucketized
    GROUP BY WIRE_ID ,
            CLASS ,
            _FROM ,
            _TO;

-- a sample with a million rows takes 6 seconds on my laptop

SouravA · Answer 6 · 2014-10-06T09:01:50.407

Try this:

select wire_id, class, [from], [to], 
ROW_NUMBER() over (order by wire_id, class,[from], [to]) gg,
1 as is_continous
into #temp1
from #WIRE_CLASS
group by wire_id, class, [from], [to]

--Including index to improve performance of selects
create index IX_temp1 on #temp1(gg) include (is_continous)
create index IX_temp2 on #temp1(wire_id, class) include (is_continous)

update this
set is_continous = 0 
from #temp1 prev join #temp1 this on prev.gg = this.gg -1
where prev.[to] <> this.[from]



update this
set this.[from] = prev.[from] 
--select prev.*, this.* 
from #temp1 prev join #temp1 this 
on prev.gg <= this.gg and this.gg <> 1 and prev.wire_id = this.WIRE_ID 
and prev.CLASS = this.CLASS and this.is_continous = 1

update prev
set prev.[to] = this.[to]
--select prev.*, this.* 
from #temp1 prev join #temp1 this 
on prev.gg <= this.gg and this.gg <> 1 and prev.wire_id = this.WIRE_ID 
and prev.CLASS = this.CLASS and this.is_continous = 1 and this.gg <> 1

select distinct wire_id, [from], [to], class
from #temp1

Finding Islands With Group and Interval Data

6 Answers6