Gaps and island fails with 3 columns using SQL Server

Question

I have come across a strange behavior with the gaps and island solution. With 3 columns (3rd column being non integer), the result is random really. Let's suppose we the following query:

Declare @Table1 TABLE
(
    ID varchar(50), 
    yr float, 
    CO1 varchar(50)
);

INSERT INTO @Table1 (ID, yr, CO1)
VALUES ('I2','2011','ABE'), ('I2','2012','ABE'), ('I2','2013','ABE'),
       ('I2','2014','ABE'), ('I2','2014','ABE'), ('I2','2005','ABD'),
       ('I2','2006','ABD'), ('I2','2007','ABD'), ('I2','2008','ABD'),
       ('I2','2007','ABA CD'), ('I2','2011','ABA CD'), ('I2','2013','ABA CD');

SELECT 
    ID, CO1, StartSeqNo = MIN(yr), EndSeqNo = MAX(yr)
FROM 
    (SELECT 
         ID, yr, CO1,
         rn = yr - ROW_NUMBER() OVER (PARTITION BY ID ORDER BY yr)
     FROM 
         @Table1) a
GROUP BY 
    ID, CO1, rn ;

The result I am aiming for is :

ID  CO1    StartSeqNo   EndSeqNo
----------------------------
I2  ABA CD    2007       2007
I2  ABA CD    2011       2011
I2  ABA CD    2013       2013
I2  ABD       2005       2008
I2  ABE       2011       2014

I have looked through stackoverflow and elsewhere to determine if I was missing something. I already tried with distinct and dense_rank, neither gives the proper result

Here are the distinct and dense_rank queries I've already tried:

--- distinct 

SELECT distinct ID,CO1, StartSeqNo=MIN(yr), EndSeqNo=MAX(yr)
FROM (
    SELECT distinct ID, yr, CO1
        ,rn=yr-ROW_NUMBER() OVER (PARTITION BY ID ORDER BY yr)
    FROM @Table1) a
GROUP BY ID, CO1, rn ;

--- with dense_rank
SELECT ID,CO1, StartSeqNo=MIN(yr), EndSeqNo=MAX(yr)
FROM (
    SELECT ID, yr, CO1
        ,rn=yr-dense_rank() OVER (PARTITION BY ID ORDER BY yr)
    FROM @Table1) a
GROUP BY ID, CO1, rn ;

I dont see why the gaps and island query would not work with having a non-integer column. I reckon there is an issue with grouping somewhere. Please help me with this.

Sim

score 1 · Answer 1 · answered Jul 03 '18 at 17:32

You need DENSE_RANK because you got multiple rows with the same ID/yr combination and you need to add CO1 to PARTITION BY:

SELECT 
    ID, CO1, StartSeqNo = MIN(yr), EndSeqNo = MAX(yr)
FROM 
    (SELECT 
         ID, yr, CO1,
         rn = yr - dense_rank() OVER (PARTITION BY ID, CO1 ORDER BY yr)
     FROM 
         @Table1) a
GROUP BY 
    ID, CO1, rn ;

Yogesh Sharma · Answer 2 · 2018-07-03T16:48:55.050

0

You appears to want :

select id, co1, min(yr), max(yr)
from (select *, (case when max(grp) over(partition by co1) > 1 then grp else 1 end) as grp1
      from (select *, yr - lag(yr, 1, yr) over (partition by id, co1 order by yr) as grp
            from table
           ) t
       ) t
group by id, co1, grp1;

edited Jul 03 '18 at 16:48

answered Jul 03 '18 at 16:09

Yogesh Sharma

49,870
5
26
52

Unfortuntley that wont work Yogesh. Your proposed solution "ABA CD 2007 2013". I require 2007, 2011, 2013 for ABA CD – Simran Jul 03 '18 at 16:16
@Simran. . . Ohh yes, you could use `lag()` function to check year gap & use cumulative approach. – Yogesh Sharma Jul 03 '18 at 16:49

score 0 · Accepted Answer · answered Jul 04 '18 at 01:12

With no gaps, the years would be a sequential numbering in each ID/CO1 group that you can compare to a no-gap numbering which of course also must be sequential for each ID/CO1 ordered by year. So, if you don't ORDER BY CO1 (before year), you must also use CO1 to PARTITION BY in the row numbering function. Also, your data contains duplicate rows, so to give equal years in an ID/CO1-group the same number, use the RANK function instead of ROW_NUMBER:

WITH a (ID, CO1, yr, nmbr) AS (
  SELECT ID, CO1, yr
    , yr - RANK() OVER (PARTITION BY ID, CO1 ORDER BY yr)
  FROM @Table1
)
SELECT ID, CO1, StartSeqNo = MIN(yr), EndSeqNo = MAX(yr)
FROM a
GROUP BY ID, CO1, nmbr;

At last let me suggest to use int instead of float for year numbers.

`RANK` will not work, must be `DENSE_RANK` instead, i.e. `2014-2014-2015` is sequential, but the result of the rank will be `1-1-3` instead of `1-1-2` — dnoeth, Jul 05 '18 at 14:20
@dnoeth Yes, you are right, the DENSE_RANK function has to be used, I wasn't aware that the RANK function will itself create gaps. With an additional sample record for ABE of 2015, I would of course have noticed that. Thank you! — Wolfgang Kais, Jul 06 '18 at 10:05

Gaps and island fails with 3 columns using SQL Server

3 Answers3