1

I know the concept of SCD-2 and I'm trying to improve my skills about it doing some practices.

I have the next scenario/experiment:

  1. I'm calling daily to a rest API to extract information about companies. In my initial load to the DB everything is new, so everything is very easy.
  2. Next day I call to the same rest API, which might returns the same companies, but some of them might have (or not) some changes (i.e., they changed the size, the profits, the location, ...)

I know SCD-2 might be really simple if the rest API returns just records with changes, but in this case it might returns as well records without changes.

In this scenario, how people detect if the data of a company has changes or not in order to apply SCD-2?, do they compare all the fields?.

Is there any example out there that I can see?

Francisco Albert
  • 1,577
  • 2
  • 17
  • 34
  • 1
    Your question is very general, but normally you have a primary key in your source and add this as a "natural key" in your dimension. Then you compare the values for the same pk with the current version (of the same natural key) in the dimension. You could use some kind of hashing to detect changes, when you run into performance issues. – Wouter Jul 28 '21 at 10:14
  • I know is very general, that reflects a bit that I'm a bit learning. Sorry about it. But it helped to know that I can use a hash to compare them and also that I might need to compare all the values. Basically that is the answer I was trying to find. Thanks a lot. – Francisco Albert Jul 28 '21 at 10:19

1 Answers1

2

There is no standard SCD-2 nor even a unique concept of it. It is a general term for large number of possible approaches. The only chance is to practice and see what is suitable for your use case.

In any case you must identify the natural key of the dimension and the set of the attributes you want to keep the history.

You may of course make it more complex by the decision to use your own surrogate key.

You mentioned that there are two main types of the interface for the process:

• You get periodically a full set of the dimension data

• You get the “changes only” (aka delta interface)

Paradoxically the former is much simple to handle than the latter.

First of all, in the full dimensional snapshot the natural key holds, contrary to the delta interface (where you may get more changes for one entity).

Additionally you have to handle the case of late change delivery or even the wrong order of changes delivery.

Next important decision is if you expect deletes to occur. This is again trivial in the full interface, you must define some convention, how this information would be passed in the delta interface. Connected is the question whether a previously deleted entity can be reused (i.e. reappear in the data).

If you support delete/reuse you'll have to thing about how to show them in your dimension table.

In any case you will need some additional columns in the dimension to cover the historical information.

Some implementation use a change_timestamp, some other use validity interval valid_from and valid_to.

Even other implementation claim that additional sequence number is required – so you avoid the trap of more changes with the identical timestamp.

So you see that before you look for some particular implementation you need carefully decide the options above. For example the full and delta interface leads to a completely different implementations.

Marmite Bomber
  • 19,886
  • 4
  • 26
  • 53