4

As a Modeler trying to find out what is the best way to handle deletes in SCD Type 2 tables.

As per principle an SCD Type 2 table is to track changes using ETL dates like START_DT and END_DT.

  1. START_DT will be the date the record is effective from.
  2. END_DT will be the date it changed to another form or Null/High Date to denote current version of record.
  3. At all point of time, for a Key Combination there will be a Current Version record with END_DT either Null or High Date.

Now if a record is deleted from Source, what is the best option from below,

  1. Have additional column like SRC_DELETE_IND which is set 'N' as default and 'Y' if record is deleted from source.
Ex: Record came on 1st Oct
PK_ID START_DT    END_DT     VALUE SRC_DELETE_IND
1     2021-10-01  Null       ABC    N
Record had an update on 2nd Oct
PK_ID START_DT    END_DT     VALUE SRC_DELETE_IND
1     2021-10-01  2021-10-02  ABC    N
1     2021-10-02  Null        XYZ    N
Record got deleted on 3rd Oct
PK_ID START_DT    END_DT     VALUE SRC_DELETE_IND
1     2021-10-01  2021-10-02  ABC    N
1     2021-10-02  Null        XYZ    Y
  1. Same as 1 but insert new duplicate row when Delete Came.
Record got deleted on 3rd Oct
PK_ID START_DT    END_DT     VALUE SRC_DELETE_IND
1     2021-10-01  2021-10-02  ABC    N
1     2021-10-02  2021-10-03  XYZ    N
1     2021-10-03  Null        XYZ    Y
  1. Instead of SRC_DELETE_IND expire/end date the record
Record got deleted on 3rd Oct
PK_ID START_DT    END_DT     VALUE 
1     2021-10-01  2021-10-02  ABC   
1     2021-10-02  2021-10-03  XYZ   

But here we now dont have a Open record left.

Complexity is added if the record reappears in the Source stating as a Incorrect delete. Lets say on 10th For Option 1, Data will look like,

PK_ID START_DT    END_DT     VALUE SRC_DELETE_IND
1     2021-10-01  2021-10-02  ABC    N
1     2021-10-02  Null        XYZ    N --Reversed

FOr Option 2

PK_ID START_DT    END_DT     VALUE SRC_DELETE_IND
1     2021-10-01  2021-10-02  ABC    N
1     2021-10-02  2021-10-03  XYZ    N
1     2021-10-03  Null        XYZ    N  -- Reversed but now row is duplicate

For Option 3

PK_ID START_DT    END_DT     VALUE 
1     2021-10-01  2021-10-02  ABC   
1     2021-10-02  2021-10-03  XYZ   
1     2021-10-10  Null        XYZ  --considered as New since no open record existed. Creates ETL gap

Which option makes more sense and is as per DWH best practices.

Michael Golos
  • 1,814
  • 1
  • 6
  • 16
Ankit Srivastava
  • 185
  • 3
  • 13
  • It depends upon what is the requirement, in SCD2 generally you don’t want to delete anything, best approach is to mark the rows with a DELETE_IND = 'Y'. For any old rows it will be valid because those old dimensions were present in the source, but for new rows you will not be using those dimensions as their DELETE_IND = 'Y'. – Himanshu Kandpal Oct 12 '21 at 17:41
  • We used `DELETE_TIME_UTC` as our delete flag, thus `END_TIME_UTC` was still null, as this phase is still the current one. This has some nice side affect of null (not deleted) takes no storage, and snowflake keeps track of number of null values per column, so you can get free pruning. – Simeon Pilgrim Oct 12 '21 at 19:35
  • 1
    Another option is to not use END Dates at all, then your view will track end-dates based on Window functions and you can add the delete/reversed rule to the view. That way the table itself is never manipulated and remains auditable. – patrick_at_snowflake Oct 12 '21 at 23:10

1 Answers1

1

I would go for a simpler way, put a default END_DATE for the deleted records like 1000-12-31 :

PK_ID START_DT    END_DT     VALUE 
1     2021-10-01  2021-10-02  ABC   
1     2021-10-02  1000-12-31  XYZ --> this row is deleted

Also, avoid using NULL values. A NULL value indicates a lack of a value, which is not the same thing as a value of zero. SQL NULL is a state, not a value. This usage is quite different from most programming languages, where null value of a reference means it is not pointing to any object.

I recommend you to use a default date for END_DT for example 9999-12-31 so when inserting a row, your dimension will be like below :

 PK_ID START_DT    END_DT     VALUE 
 1     2021-10-01  9999-12-31  ABC 

instead of :

PK_ID START_DT    END_DT     VALUE 
1     2021-10-01  NULL       ABC 

I recommend you to add a surrogate key to your dimensions. A dimension table is designed with one column serving as a unique primary key. This primary key cannot be the operational system’s natural key because there will be multiple dimension rows for that natural key when changes are tracked over time. In addition, natural keys for a dimension may be created by more than one source system, and these natural keys may be incompatible or poorly administered. The DW/BI system needs to claim control of the primary keys of all dimensions; rather than using explicit natural keys or natural keys with appended dates, you should create anonymous integer primary keys for every dimension. These dimension surrogate keys are simple integers, assigned in sequence, starting with the value 1, every time a new key is needed. The date dimension is exempt from the surrogate key rule; this highly predictable and stable dimension can use a more meaningful primary key.

Amira Bedhiafi
  • 8,088
  • 6
  • 24
  • 60