4

I am having issues retrieving the latest value in a satellite table when some data is changed back to a former value.

The database is Snowflake.

As per Data Vault 2.0, I am currently using the hash diff function to assess whether to insert a new record in a satellite table, like this:

INSERT ALL
  WHEN (SELECT COUNT(*) FROM SAT_ACCOUNT_DETAILS AD WHERE AD.MD5_HUB_ACCOUNT = MD5_Account AND AD.HASH_DIFF = AccHashDiff) = 0
  THEN
    INTO SAT_ACCOUNT_DETAILS (MD5_HUB_ACCOUNT
                            , HASH_DIFF
                            , ACCOUNT_CODE
                            , DESCRIPTION
                            , SOME_DETAIL
                            , LDTS)
    VALUES (MD5_AE_Account
          , AccHashDiff
          , AccountCode
          , Description
          , SomeDetail
          , LoadDTS)
SELECT DISTINCT
    MD5(AccountId) As MD5_Account
  , MD5(UPPER(COALESCE(TO_VARCHAR(AccountCode), '')
    || '^' || COALESCE(TO_VARCHAR(Description), '')
    || '^' || COALESCE(TO_VARCHAR(SomeDetail), '')
    )) AS AccHashDiff
  , AccountCode
  , Description
  , SomeDetail
  , LoadDTS
FROM source_table;

The first time, a new record with AccountCode = '100000' and SomeDetail = 'ABC' is added:

MD5_HUB_ACCOUNT HASH_DIFF ACCOUNT_CODE DESCRIPTION SOME_DETAIL LDTS
c81e72... 8d9d43... 100000 An Account ABC 2021-04-08 10:00

An hour later, an update changes the value of SomeDetail to 'DEF', this is the resulting table:

MD5_HUB_ACCOUNT HASH_DIFF ACCOUNT_CODE DESCRIPTION SOME_DETAIL LDTS
c81e72... 8d9d43... 100000 An Account ABC 2021-04-08 10:00
c81e72... a458b2... 100000 An Account DEF 2021-04-08 11:00

A third update sets the value of SomeDetail back to 'ABC', but the record is not inserted in the satellite table, because the value of the hash diff is the same as the first inserted record (i.e. 8d9d43...).

If I query which is the latest record in the satellite table, the LDTS column tells me it's the one with 'DEF' which is not the desired result. Instead, I should have a record with SomeDetail = 'ABC' and LDTS = '2021-04-08 12:00'.

What is the correct approach to this? If I add LoadDTS to the hash diff, a new record will be created each time an update is pushed, which is not the desired result either.

Roberto B.
  • 41
  • 5
  • for our SCD2 pipeline, we hash the values, but also have a `_is_current` which is true only for the current row, thus we only compare the _current_ row's hashes, when injecting altered data. It does mean you have to alter that row when you do the insert, which I don't recall the pattern of. – Simeon Pilgrim Apr 10 '21 at 00:15
  • _If I query which is the latest record in the satellite table, the LDTS column tells me it's the one with 'DEF'_ But that **is** the latest record? If you compare your incoming record to the latest record it will be different (although with hash there is always a low chance it will not be different) – Nick.Mc Apr 11 '21 at 07:09

1 Answers1

2

As you (and also the standard) mentionned, you need to compare to the last effective record.

I'm not an expert with Snowflake, but it might look like this :

INSERT ALL
  WHEN (SELECT COUNT(*) FROM SAT_ACCOUNT_DETAILS AD WHERE AD.MD5_HUB_ACCOUNT = MD5_Account AND AD.HASH_DIFF = AccHashDiff AND AD.LDTS = (SELECT MAX(LDTS) FROM SAT_ACCOUNT_DETAILS MAD WHERE MAD.MD5_HUB_ACCOUNT = AD.MD5_HUB_ACCOUNT)) = 0
  THEN ....

By adding "AD.LDTS = (SELECT MAX(LDTS) FROM....." to the query, you make sure you test against the latest data and not historical data

MLeblanc
  • 1,816
  • 12
  • 21