3

In SSIS, if an incoming dataset has multiple records for the same Business Key, how do I load it to the dimensions table with SCD type 2 without using the SCD Wizard.

Sample dataset

Customer ID   Name      Segment     Postal Code
1             James     Corporate   50026
2             Andrew    Consumer    33311
3             Steven    Consumer    90025
2             Andrew    Consumer    33306
3             Steven    Consumer    90032
1             James     Corporate   50087
3             Steven    Consumer    90000

In my case, if I try Loading the dimension table with other SSIS components (Lookup/Conditional Split) all the record show up a new row in the table because they are all coming in all at the same time.

I have ‘CurrentFlag’ as the indicator of the current record.

In SSIS, if I have an incoming dataset that has multiple records for the same Business Key, How do I get to recognize these, and set the CurrentFlag as necessary, whether or not a record in the target table has that Business Key already?

Thanks.

  • Are you using the `ID` column as the timeline of row changes? So as the `ID` value gets bigger, the row is more recent? – iamdave May 10 '19 at 07:32
  • And can you show us your current attempts, specifically using `Lookups`? If you are not using multiple stages as per the examples [here](https://benefic.net/ssis-alternatives-to-the-scd-transform/) then that will be where you are going wrong – iamdave May 10 '19 at 07:35
  • @iamdave Id is an identity column and primary key in the dimension table. Dimension table is as below CREATE TABLE CustomerDetails ( ID int IDENTITY(1,1) PRIMARY KEY, Customer ID varchar(50), Name varchar(50), Segment varchar(50), PostalCode int, Flag bit ); Also incoming dataset do not has ID column.. i will edit the question. – Saurav Kumar May 10 '19 at 10:19
  • @iamdave I am using same as what in the link you shared. The problem is that as i have multiple rows with same Customer ID (Business Key) in the incoming dataset, I am receiving all of them in No Match Output and hence all are inserted as a new record in the dimension table. – Saurav Kumar May 10 '19 at 10:29
  • How do you know which row is the most recent? Slowly Changing Dimensions only work if you know when the row was created or updated. – iamdave May 10 '19 at 14:51
  • CurrentFlag is used to track recent update. If CurrentFlag= Y its latest and all the historical data will have CurrentFlag=N – Saurav Kumar May 14 '19 at 12:50
  • If you have more than one update come through at once though, how do you know what order they happened in? – iamdave May 14 '19 at 15:23

1 Answers1

0

OK, this is a massive simplification because SCD's are very challenging to correctly implement. You will need to sit down and think critically about this. My answer below only handles ongoing daily processing - it does not explain how to handle historical files being re-processed, which could potentially result in duplicate records with different EffectiveStart and End Dates.

By definition, you will have an existing record source component (i.e., query from the database table) and an incoming data source component (i.e., a *.csv flatfile). You will need to perform a merge join to identify new records versus existing records. For existing records, you will need to determine if any of the columns have changed (do this in a Derived Column transformation).

You will need to also include two columns for EffectiveStartDate and EffectiveEndDate.

IncomingEffectiveStartDate = FileDate
IncomingEffectiveEndDate = 12-31-9999
ExistingEffectiveEndDate = FileDate - 1

Note on 12-31-9999: This is effectively the Y10K bug. But, it allows users to query the database between date ranges without having to consciously add ISNULL(GETDATE()) in the WHERE clause of a query in the event that they are querying between date ranges.

This will prevent the dates on the columns from overlapping, which could potentially result in multiple records being returned for a given date.

To determine if a record has changed, create a new column called RecordChangedInd of type Bit.

(ISNULL(ExistingColumn1, 0) != ISNULL(IncomingColumn1, 0) ||
 ISNULL(ExistingColumn2, 0) != ISNULL(IncomingColumn2, 0) ||

....

ISNULL(ExistingColumn_N, 0) != ISNULL(IncomingColumn_N, 0) ? 1 : 0)

Then, in your split condition you can create two outputs: RecordHasChanged (this will be an INSERT) and RecordHasNotChanged (this will be an UPDATE to deactivate the exiting record and an INSERT).

You can conceivably route both inputs to the same INSERT destination. But, you will need to be careful suppress the update record's ExistingEffectiveEndDate value that deactivates the date.

J Weezy
  • 3,507
  • 3
  • 32
  • 88