Identify ranges in groups which are identified by a flag

Question

I have the following table:

declare  @table table (dates int , is_missing tinyint, group_id numeric(18))
insert into @table(dates,is_missing,group_id)
select 20110719,0,1
union all
select 20110720,0,1
union all
select 20110721,0,1
union all
select 20110722,1,1
union all
select 20110723,0,1
union all
select 20110724,0,1
union all
select 20110725,0,1
union all
select 20110726,1,1
union all
select 20110727,0,1
union all
select 20110728,1,1
union all
select 20110723,1,3
union all
select 20110724,0,3
union all
select 20110725,0,3
union all
select 20110726,1,3
union all
select 20110727,0,3


select * from @table
order by group_id, dates

What I am trying to do is to return ranges of dates for each group which are identified by the missing day flag. To make this more clear the results of the query will have to look like this:

group_id    start_date   end_date      days_count
1            20110719       20110721      3
1            20110723       20110725      3
1            20110727       20110727      1
3            20110724       20110725      2
3            20110727       20110727      1

The is_missing flag basicaly separates the ranges per group. It actually says that a date is missing and therefore all the other dates located between is_missing flags are the groups I am trying to find their start and end dates as well as their days numbers count.

Is there a simple way to do this?

Thanks a lot.

score 1 · Answer 1 · edited May 23 '17 at 12:19

Here is a possible solution using Common Table Expression (CTE) and ROW_NUMBER(). This type of problem is known as islands. Using the concept that was used in this Stack Overflow question: sql group by only rows which are in sequence, the following query was formulated to produce desired output against the data provided by you.

~~This query works correctly if the data stored in the table is ordered by group_id and dates columns. I assume that is the case with your data. If not, you might need to tweak the solution.~~

Modified the query as per suggestions provided by Andriy M. Thanks to Andriy M.

The query has been changed so that it can provide correct output even if the date values in the table are not in sequence. The question has the date values stored in int data type instead of date format. So, two queries have been provided below. First query will work if the table contains date values stored in int data typeand the second query will work if the table contains date values stored in datetime or date data type.

This query will work only in SQL Server versions 2005 and above. Since you have tagged your question under sql-server-2008, I think this should work for you.

Screenshot #1 displays the data stored in the table. Screenshot #2 displays the output of the below mentioned queries against the table data.

Hope that helps.

Query for date values stored in int data type: .

WITH cte AS
(       
    SELECT  datenumeric
        ,   is_missing
        ,   group_id
        ,   datenumeric
                - DENSE_RANK() OVER (PARTITION BY is_missing ORDER BY group_id, datenumeric) AS partition_grp 
    FROM    dbo.table_data
)
SELECT      cte.group_id
        ,   MIN(cte.datenumeric)      AS start_date
        ,   MAX(cte.datenumeric)      AS end_date
        ,   COUNT(cte.datenumeric)    AS days_count
FROM        cte
WHERE       cte.is_missing = 0
GROUP BY    cte.group_id
        ,   cte.partition_grp
ORDER BY    cte.group_id
        ,   cte.partition_grp;

Query for date values stored in datetime or date data type: .

WITH cte AS
(       
    SELECT  datevalue
        ,   is_missing
        ,   group_id
        ,   DATEDIFF(DAY, 0, datevalue)
                - DENSE_RANK() OVER (PARTITION BY is_missing ORDER BY group_id, datevalue) AS partition_grp 
    FROM    dbo.table_data
)
SELECT      cte.group_id
        ,   MIN(cte.datevalue)      AS start_date
        ,   MAX(cte.datevalue)      AS end_date
        ,   COUNT(cte.datevalue)    AS days_count
FROM        cte
WHERE       cte.is_missing = 0
GROUP BY    cte.group_id
        ,   cte.partition_grp
ORDER BY    cte.group_id
        ,   cte.partition_grp;

Screenshot #1:

Screenshot #2:

Basically, your solution does follow the pattern in the linked answer. However, it would be safer to rank the rows by `date` instead of by `id`, in case one day some historical data were added and so older dates would use newer IDs. And if you chose ranking by `date`, I would also recommend you to replace the global `ROW_NUMBER` (i.e. the first one) with something to represent the date as an integer (like `DATEDIFF(day, 0, date)`), because the more ranking functions you use, the more expensive your query is. — Andriy M, May 30 '11 at 08:12
On a different point, there's a really great temptation to use `datenumeric` for calculating group IDs like you did. Why, it's already numeric and doesn't seem to need any special processing before applying it to the formula. But your method would actually break on month transitions. For example, `20110530 - 1` and `20110531 - 2` produce the same result and the corresponding dates would justly be grouped together. But the very next date gets to a new group (which is incorrect), because the result of `20110601 - 3` is different. … — Andriy M, May 30 '11 at 12:58
So `datenumeric` should either be cast twice (to `varchar`, then to `datetime`) and used with DATEDIFF, or... ranked, like you did originally. (I'm just not certain which would be better, it needs some testing to be sure.) — Andriy M, May 30 '11 at 12:58
As I said in my deleted comment, it's a good idea to take measures about working around possible duplicate dates. But I think you forgot to account for that in `COUNT()`. If you had a standalone, but duplicated date, your query would count it more than once, and you would have rows like `1, 2011-07-27, 2011-07-27, 2`. That is easily fixed by either using `COUNT(DISTINCT cte.datevalue)` instead of `COUNT(cte.datevalue)` or by changing the expression to `MAX(cte.datevalue) - MIN(cte.datevalue) + 1`. Still, it would be nice if the OP clarified whether there could be duplicates in the first place. — Andriy M, May 30 '11 at 13:20

score 0 · Answer 2 · answered May 30 '11 at 18:03

With many thanks to Siva for the nice solution, I thought if there was one date missing in the data, the query would fail.

so I modified the query a little and used ROW_NUMBER() to fix that.

WITH cte AS
(       
    SELECT  dates 
        ,   is_missing
        ,   group_id
        ,ROW_NUMBER() OVER (ORDER BY group_id, dates) -
           DENSE_RANK() OVER (PARTITION BY is_missing ORDER BY group_id, dates) AS partition_Id 
    FROM    dbo.table_data
)
SELECT      group_id
        ,   MIN(dates) AS start_date
        ,   MAX(dates) AS end_date
        ,   COUNT(*) AS days_count
FROM        cte
WHERE       is_missing = 0
GROUP BY    group_id
        ,   partition_id
ORDER BY    group_id
        ,   partition_id;

Or maybe a missing date will never happen. :)

Identify ranges in groups which are identified by a flag

2 Answers2