2

I have following table, which apart from other attributes contains:

  1. Customer ID - unique identifier
  2. Value
  3. CreatedDate - when the record has been created (based on ETL)
  4. UpdatedDate - until when the record has been valid

Since there are other attributes apart from the [Value], which are being tracked for historical values, there might be cases, where there are multiple rows with the same [Value] for the same customer, but different timestamps in [CreatedDate] / [UpdatedDate]. Thus, the data may look like:

Customer ID Value CreatedDate UpdatedDate
1 111 04/08/2021 15:00 04/08/2021 17:00
1 111 01/08/2021 09:00 04/08/2021 15:00
1 222 20/07/2021 01:30 01/08/2021 09:00
1 222 01/06/2021 08:00 20/07/2021 01:30
1 111 01/04/2021 07:15 01/06/2021 08:00
2 333 03/08/2021 04:30 04/08/2021 17:00
2 444 23/07/2021 01:20 03/08/2021 04:30
2 444 01/04/2021 13:50 23/07/2021 01:20

I would like to keep the unique [Values] in correct sequence, hence keep the [Value] for the earliest [CreatedDate], however, if Customer had originally Value1, then changed it to Value2 and finally, changed back to Value1. I would like to keep these 2 changes as well. Hence the ideal output should look like:

Customer ID Value CreatedDate UpdatedDate
1 111 01/08/2021 09:00 04/08/2021 17:00
1 222 01/06/2021 08:00 01/08/2021 09:00
1 111 01/04/2021 07:15 01/06/2021 08:00
2 333 03/08/2021 04:30 04/08/2021 17:00
2 444 01/04/2021 13:50 03/08/2021 04:30

Based on CreatedDate / UpdatedDate identify, the chronological sequence of changes and identify the earliest CreatedDate and latest UpdatedDate. However, if particular value appeared multiple times, but has been interspersed by different value, I would like to keep it too.

I've tried the below approach and it works fine however it does not work for the scenario above and the output look like:

SELECT [Customer ID]
        ,Value
        ,MIN(CreatedDate) as CreatedDate
        ,MAX(UpdatedDate) as UpdatedDate
                    
FROM #History
GROUP BY ID, Value
Customer ID Value CreatedDate UpdatedDate
1 111 01/04/2021 07:15 04/08/2021 17:00
1 222 01/06/2021 08:00 01/08/2021 09:00
2 333 03/08/2021 04:30 04/08/2021 17:00
2 444 01/04/2021 13:50 03/08/2021 04:30

Any ideas, please? I've tried using LAG and LEAD as well, but was not able to make it work either.

Thom A
  • 88,727
  • 11
  • 45
  • 75
Srpic
  • 450
  • 5
  • 13
  • 29
  • 1
    Does this answer your question? [Find min and max for subsets of consecutive rows - gaps and islands](https://stackoverflow.com/questions/38151616/find-min-and-max-for-subsets-of-consecutive-rows-gaps-and-islands) – Thom A Aug 04 '21 at 15:23
  • 1
    Does this answer your question? [Group similar objects in different date ranges to get min and max dates in SQL Server](https://stackoverflow.com/questions/49944280/group-similar-objects-in-different-date-ranges-to-get-min-and-max-dates-in-sql-s) – Thom A Aug 04 '21 at 15:24
  • I'm not sure I understand what the UpdatedDate column represents; could you elaborate on that please. Just to be clear as well, the CreatedDate records when that row was inserted into the History table correct? – Jacob FW Aug 04 '21 at 15:29
  • @JacobFW yes, that is correct. The UpdatedDate is usually NULL for the latest record and as soon as the change is identified, the UpdatedDate is CreatedDate from row above. Does it make sense? – Srpic Aug 04 '21 at 15:35
  • @Srpic Gotcha. So the history table records the changes, but do you have another table that has the latest values? So for instance let's say you had a table storing contact information, like firstname, lastname, address, phone numbers, etc. You would have the primary contacts table that always has the latest information, then the history table that records all the changes. Do you have a table like that? – Jacob FW Aug 04 '21 at 15:41
  • My suggestion would be if you do, make sure you have a created timestamp and lastupdated timestamp column in that table. If you're goal is just to know when the primary record was created and last updated, referring to those columns is much more efficient, than always have to do a search of the history table. – Jacob FW Aug 04 '21 at 15:46

1 Answers1

1

This is a type of gaps-and-island problem that is probably best solved by looking for overlaps using a cumulative maximum:

select customerid, min(createddate), max(updateddate)
from (select t.*,
             sum(case when prev_updatedate >= createddate then 0 else 1 end) over (partition by customerid, value order by createddate) as grp
      from (select h.*,
                   max(updateddate) over (partition by customerid, value order by createddate rows between unbounded preceding and 1 preceding) as prev_updatedate
            from #history h
           ) h
     ) h
group by customerid, value, grp;

The logic is to look at the most recent updatedate before each row for each customer and value. If this is earlier than the row's create date, then this starts are new group.

The final result is just aggregating the rows in each group.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786