I have implemented CDC - SCD Type 2 on the customer data set.
I have implemented CDC on large of columns but the ask is to track behavior for only subset of those columns.
In the below input table I have ID column for the customer and RATE_CODE as one of the CDC field and START and END DATES are the CDC changes dates.
In this I wan't to know how over a period of time the customer data(RATE_CODE) is changing.
EX Row 1-3 has same RATE_CODE thus i need min(START_DATE) from ROW#1 and max(END_DATE) from ROW#3.
I tried applying group by on (ID,RATE_CODE) and min and max on dates but it is giving wrong value as in that case the max value will be picked from ROW#9 (for which I want a separate entry considering the ROW#6-9)
INPUT TABLE
ROW NUMBER | ID | RATE_CODE | START_DATE | END_DATE |
---|---|---|---|---|
1 | 1 | A1 | 01-01-2021 | 18-01-2021 |
2 | 1 | A1 | 18-01-2021 | 25-02-2021 |
3 | 1 | A1 | 25-02-2021 | 15-03-2021 |
4 | 1 | A2 | 15-03-2021 | 28-03-2021 |
5 | 1 | A2 | 28-03-2021 | 28-05-2021 |
6 | 1 | A1 | 28-05-2021 | 28-06-2021 |
7 | 1 | A1 | 28-06-2021 | 12-07-2021 |
8 | 1 | A1 | 20-07-2021 | 28-07-2021 |
9 | 1 | A1 | 28-08-2021 | 13-09-2021 |
10 | 1 | A2 | 13-09-2021 | 13-10-2021 |
EXPECTED OUTPUT
ID | RATE_CODE | START_DATE | END_DATE |
---|---|---|---|
1 | A1 | 01-01-2021 | 15-03-2021 |
1 | A2 | 15-03-2021 | 28-05-2021 |
1 | A1 | 28-05-2021 | 13-09-2021 |
1 | A2 | 13-09-2021 | 13-10-2021 |
There could be some articles or answer on the net as well but due to framing of the question I couldn't find them.
I want the solution in SQL but for the community PySpark and other languages are also welcomed.