Data warehouse modelling issue

Question

I have a typical fact table in the data warehouse - the table has some surrogate keys and measures. In data, the warehouse is also lookup tables - small dimensions inside which history is not maintained. Just surrogate key, business key, and one or two attributes. During fact table load, surrogate keys are taken from lookup tables (joins are based on business keys). So basically on some stage of fact loading, we have business keys inside fact which are used to grab surrogate keys from lookup tables, after that operation, business keys are gone, and later e.g. in data mart (for reporting purposes) we can join lookups with fact using surrogate keys only for some attributes. Until now the process was rather straightforward because we used only one business key to set up attribute value.

But now there are cases where we should use 3 or even more.

E.g. these are conditions:

COLUMN_1 = 'ABC' 
AND COLUMN_2 <> 'Z'
AND COLUMN_3 IN ('1', '2')

COLUMN_1 = 'ABC' 
AND COLUMN_2 <> 'Z'
AND  COLUMN_3 IN ('3', '4', '5')

COLUMN_1 <> 'ABC' OR COLUMN_2 = 'Z'

COLUMN_1, COLUMN_2, COLUMN_3 are business keys that are presented in the fact table. For sure above logic will be applied in the look-up load so let's say that we will have 3 surrogate keys: 1, 2, and 3.

But the main questions is - which approach will be better for the fact table:

To duplicate logic from look up load to fact load? Better performance but this is worse for maintenance (if this needs to be changed in the future, the change will need to be applied in two places), and also surrogate keys will need to be hardcoded.
Put above conditions in join condition? Obviously will be better for maintenance, however much worse for performance (fact table has around 10 000 000 rows inserted daily).
Or maybe there is another solution? Combining the above conditions in the source system also is not an option.

All suggestions are welcome.

score 0 · Answer 1 · answered Oct 19 '20 at 18:39

I have modelled the surrogate keys & fact loads many a times in the past and in my experience (15 years) what works the best as a good balance is the following design:

Surrogate key table design (dim_sgk) Dim_ID BR1_ID1 BR1_ID2 BR1_ID3 BRn_IDx... Record_Start_dttm

Assuming you have a stage table (stg_tbl) where you load your source data / file with business keys src1.col1, src1.col2 & src1.col3

Now, when you load a surrogate key table from a stage table such that you

Select *
from 
stg_tbl left outer join dim_sgk 
on stg_tbl.src1.col1 = dim_sgk.br1_id1
and stg_tbl.src1.col2 = dim_sgk.br1_id2
and stg_tbl.src1.col3 = dim_sgk.br1_id3
where dim sgk.dim_id is null

and generate (or auto based on rdbms technology and pros/cons for it) surrogate keys for all unique combinations of the 3 business keys.

Once, surrogate key table has been refreshed; you can begin loading your fact by joining your source transaction table with surrogate key table & picking up the surrogate keys along the way (left outer join)

I would already keep the business keys in the fact table as a non pk attribute only for reporting purposes. There is no need to join surrogate keys with facts to only pick business keys later on. You use surrogate keys to optimize joins and distribution of data on disks. But while keeping business keys as non identifying attributes in fact you get best of both worlds.

There are multiple principles of surrogate keys you should keep in mind (error handling, orphan handling etc). Depending upon your rdbms technology it might make sense to partition your table based on a certain index which helps you retrieve errors / -1s and reprocess them into fact table without any performance hit. If you need to know more on this technique feel free to reach out to me. Happy to help.

There is a very detailed guidline I put together for another question on SGKs and you can use it for reference Managing surrogate keys in a data warehouse

Kind regards, Babar

Thank you Babar for this clarification. Your approach looks totally fine, however we are not obliged to change the logic in warehouse which is in production for couple of years. What's more I think you are talking about main surrogate keys which are defining unique record in given table, I used term surrogate but this could be confusing because this is id not for row in fact table but in id for row in mentioned lookup table. Anyway we need to stick to described logic which cannot be change. — mila, Oct 20 '20 at 08:18

Data warehouse modelling issue

1 Answers1