Find start and end dates when one field changes

Question

I have this data in a table

FIELD_A   FIELD_B     FIELD_D
249052903   10/15/2011 N
249052903   11/15/2011 P ------------- VALUE CHANGED
249052903   12/15/2011 P
249052903   1/15/2012   N ------------- VALUE CHANGED
249052903   2/15/2012   N
249052903   3/15/2012   N
249052903   4/15/2012   N
249052903   5/15/2012   N
249052903   6/15/2012   N
249052903   7/15/2012   N
249052903   8/15/2012   N
249052903   9/15/2012   N

When ever the value in FIELD_D changes it forms a group and I need the min and max dates in that group. The query shoud return

FIELD_A   GROUP_START   GROUP_END
249052903   10/15/2011  10/15/2011
249052903   11/15/2011  12/15/2011
249052903   1/15/2012              9/15/2012

The examples that I have seen so far have the data in Field_D being unique. Here the data can repeat as shown, First it is "N" then it changes to "P" and then back to "N".

Any help will be appreciated

Thanks

score 3 · Accepted Answer · answered Mar 27 '13 at 05:05

3

You can use analytic functions - LAG, LEAD, and COUNT() OVER to your advantage, if they are supported by your SQL implementation. SQL Fiddle here.

WITH EndsMarked AS (
  SELECT
    FIELD_A,
    FIELD_B,
    CASE WHEN FIELD_D = LAG(FIELD_D,1) OVER (ORDER BY FIELD_B)
         THEN 0 ELSE 1 END AS IS_START,
    CASE WHEN FIELD_D = LEAD(FIELD_D,1) OVER (ORDER BY FIELD_B)
         THEN 0 ELSE 1 END AS IS_END
  FROM T
), GroupsNumbered AS (
  SELECT
    FIELD_A,
    FIELD_B,
    IS_START,
    IS_END,
    COUNT(CASE WHEN IS_START = 1 THEN 1 END)
      OVER (ORDER BY FIELD_B) AS GroupNum
  FROM EndsMarked
  WHERE IS_START=1 OR IS_END=1
)
  SELECT
    FIELD_A,
    MIN(FIELD_B) AS GROUP_START,
    MAX(FIELD_B) AS GROUP_END
    FROM GroupsNumbered
    GROUP BY FIELD_A, GroupNum;

answered Mar 27 '13 at 05:05

Steve Kass

7,144
20
26

Thanks for the answer, however this will not yield the correct answer as only one row is returned. The EndsMarked creates only one group – cooperjv Mar 27 '13 at 20:24
Did you run the SQL Fiddle? The query returns three rows - exactly the three rows you said "the query should return." – Steve Kass Mar 27 '13 at 21:44
Thanks for the followup. It was my mistake. The query returned the correct data. Thanks for all the help. – cooperjv Mar 27 '13 at 22:35

Gordon Linoff · Answer 2 · 2013-03-27T13:03:12.220

1

This is fairly easy to express in SQL using subqueries:

select Field_A, Field_D, min(Field_B) as Group_Start, max(Field_B) as Group_End
from (select t.*,
             (select min(field_B)
              from t t2
              where t2.field_A = t.field_A and
                    t2.field_B > t.field_B and
                    t2.Field_D <> t.field_D
             ) as TheGroup
      from t
     ) t
group by Field_A, Field_D, TheGroup

This is assigning a group identifier using a correlated subquery. The identifier is the first value of Field_B where Field_D changes.

You don't mention the database you are using, so this uses standard SQL.

edited Mar 27 '13 at 13:03

answered Mar 27 '13 at 01:31

Gordon Linoff

1,242,037
58
646
786

this query creates only two groups. TheGroup needs the additional condition t2.Field_B > t.Field_B. I still think a stored procedure would be a faster and easier to maintain solution. – koriander Mar 27 '13 at 11:34
@koriander . . . I thnk you for pointing out that I left out the condition on `Field_B`. I do disagree with the rest of your comment, though. – Gordon Linoff Mar 27 '13 at 13:05
I actually like the elegance of your SQL query. But if performance is a problem I would definitely look into a basic sequential file processing. The original data looks like coming from a log, naturally sorted, no need for indexing. – koriander Mar 27 '13 at 13:28
How can the `t2` subquery filter based on fields from `t`? I thought subqueries couldn't access fields outside themselves. – Stevoisiak Dec 01 '21 at 17:05

koriander · Answer 3 · 2013-03-27T10:58:36.033

0

Don't use SQL for this problem because it is not possible to do it in SQL with a single table scan since it requires comparison between records. It would need a full table scan plus at least a join with itself. It is trivial to implement a solution in a imperative language and it only requires a single table scan. Edit: a stored procedure would be best.

edited Mar 27 '13 at 10:58

answered Mar 27 '13 at 00:03

koriander

3,110
2
15
23

1

This statement is utterly false. Tables in SQL are inherently unordered. In order to retrieve the adta in the right order, you would need to use an `order by` clause, which requires multiple reads/writes of the data. – Gordon Linoff Mar 27 '13 at 13:04
According to the theoretical relational model, relations are unordered. However, in practice, tables can be stored in order, using a clustered index, see http://stackoverflow.com/questions/1251636/what-do-clustered-and-non-clustered-index-actually-mean. Furthermore, SQL to solve this problem would use several indexes and the order_by needs one index. And the table itself only needs to be read once, per index, not multiple times as you claim. As a matter of fact, having this table stored as single sequential file, not in a DBMS, for this problem alone, looks to be the best solution to me. – koriander Mar 27 '13 at 13:24

C. Sorensen · Answer 4 · 2020-11-09T11:06:40.717

I modified the answers a bit where you have multiple Field_A's. This should always work :-)

WITH EndsMarked 
AS 
(
    SELECT 

         [Field_A]
        ,[Field_B]
        ,CASE 
            WHEN LAG([Field_D],1) OVER (PARTITION BY [Field_A] ORDER BY [Field_A],[Field_B]) IS NULL
             AND ROW_NUMBER() OVER (PARTITION BY [Field_A] ORDER BY [Field_B]) = 1 
            THEN 1
            WHEN LAG([Field_D],1) OVER (PARTITION BY [Field_A] ORDER BY [Field_A],[Field_B]) > 0
              <> LAG([Field_D],0) OVER (PARTITION BY [Field_A] ORDER BY [Field_A],[Field_B]) > 0
            THEN 1 
            ELSE 0 
        END AS IS_START
       ,CASE 
            WHEN LEAD([Field_D],1) OVER (PARTITION BY [Field_A] ORDER BY [Field_A],[Field_B]) IS NULL
             AND ROW_NUMBER() OVER (PARTITION BY [Field_A] ORDER BY [Field_B] DESC) = 1 
            THEN 1
            WHEN LEAD([Field_D],0) OVER (PARTITION BY [Field_A] ORDER BY [Field_A],[Field_B]) 
              <> LEAD([Field_D],1) OVER (PARTITION BY [Field_A] ORDER BY [Field_A],[Field_B]) 
            THEN 1          
            ELSE 0 
        END                 AS IS_END

    FROM 
    (
        SELECT

            [Field_A]
           ,[Field_B]
           ,[Field_D]
           ,[Aantal Facturen]

        FROM [T]

    )   F
    
)
,GroupsNumbered 
AS 
(
  SELECT
     [Field_A]
    ,[Field_B]
    ,IS_START
    ,IS_END
    ,COUNT(CASE
               WHEN IS_START = 1 
               THEN 1 
           END)                     OVER (ORDER BY [Field_A]
                                                  ,[Field_B]) AS GroupNum
  FROM      EndsMarked
  WHERE     IS_START        = 1 
     OR     IS_END          = 1
)

    SELECT

        [Field_A]
        ,MIN([Field_B]) AS GROUP_START
        ,MAX([Field_B]) AS GROUP_END

    FROM GroupsNumbered

    GROUP BY [Field_A], GroupNum

Find start and end dates when one field changes

4 Answers4

Linked