As a Modeler trying to find out what is the best way to handle deletes in SCD Type 2 tables.
As per principle an SCD Type 2 table is to track changes using ETL dates like START_DT and END_DT.
- START_DT will be the date the record is effective from.
- END_DT will be the date it changed to another form or Null/High Date to denote current version of record.
- At all point of time, for a Key Combination there will be a Current Version record with END_DT either Null or High Date.
Now if a record is deleted from Source, what is the best option from below,
- Have additional column like SRC_DELETE_IND which is set 'N' as default and 'Y' if record is deleted from source.
Ex: Record came on 1st Oct
PK_ID START_DT END_DT VALUE SRC_DELETE_IND
1 2021-10-01 Null ABC N
Record had an update on 2nd Oct
PK_ID START_DT END_DT VALUE SRC_DELETE_IND
1 2021-10-01 2021-10-02 ABC N
1 2021-10-02 Null XYZ N
Record got deleted on 3rd Oct
PK_ID START_DT END_DT VALUE SRC_DELETE_IND
1 2021-10-01 2021-10-02 ABC N
1 2021-10-02 Null XYZ Y
- Same as 1 but insert new duplicate row when Delete Came.
Record got deleted on 3rd Oct
PK_ID START_DT END_DT VALUE SRC_DELETE_IND
1 2021-10-01 2021-10-02 ABC N
1 2021-10-02 2021-10-03 XYZ N
1 2021-10-03 Null XYZ Y
- Instead of SRC_DELETE_IND expire/end date the record
Record got deleted on 3rd Oct
PK_ID START_DT END_DT VALUE
1 2021-10-01 2021-10-02 ABC
1 2021-10-02 2021-10-03 XYZ
But here we now dont have a Open record left.
Complexity is added if the record reappears in the Source stating as a Incorrect delete. Lets say on 10th For Option 1, Data will look like,
PK_ID START_DT END_DT VALUE SRC_DELETE_IND
1 2021-10-01 2021-10-02 ABC N
1 2021-10-02 Null XYZ N --Reversed
FOr Option 2
PK_ID START_DT END_DT VALUE SRC_DELETE_IND
1 2021-10-01 2021-10-02 ABC N
1 2021-10-02 2021-10-03 XYZ N
1 2021-10-03 Null XYZ N -- Reversed but now row is duplicate
For Option 3
PK_ID START_DT END_DT VALUE
1 2021-10-01 2021-10-02 ABC
1 2021-10-02 2021-10-03 XYZ
1 2021-10-10 Null XYZ --considered as New since no open record existed. Creates ETL gap
Which option makes more sense and is as per DWH best practices.