creating dimension surrogate keys

Question

I understand there are good reasons for using surrogate keys in data warehouse dimensions. Still, I do not understand how I can link them to my fact table's foreign keys. In the fact table I have only natural keys, extracted during ETL. Surrogate keys are not present in the original database tables. Any suggestions on this? Thank you

can you post a ERD of your facts/dims to help explain? – fez Dec 14 '15 at 13:49 — fez, Dec 14 '15 at 13:49

score 5 · Answer 1 · edited Dec 14 '15 at 17:42

There are several "see" references below. This is my first answer on Stack Overflow, so I don't have enough reputation points to provide the links to you yet. If you lookup those terms on Wikipedia, they provide a more eloquent description of those things than I could.

In data warehouses I have worked with we typically store the surrogate keys that reference the various dimensions in the fact table. In fact, I avoid storing the natural keys from the source systems in the fact table except in special circumstances (e.g., degenerate dimensions). There are a couple of reasons for this:

Join efficiency - some source systems might not use a simple integer key; using a surrogate key allows you to reduce this complexity so that your data warehouse query performance is better since it only has to deal with a single column integer join.
- Abstraction of your fact table from the source system; your fact table might outlive a particular source system (or version of a source system) or the facts may be coming from different source systems with different natural keys. Regardless of natural key design, the relationship between the fact table and the dimension table remains the same.
- Accurate and efficient point-in-time facts - If history of attributes in your dimensions are important, you can use the surrogate key to allow you to store historical copies of your dimension rows and attach the correction version to your fact table row (see Slowly Changing Dimension; especially Type 2)
- The dimension may be used by multiple fact tables from multiple source systems or my be consolidated from multiple source systems in which case there will not be a simple relationship between source system natural key and the dimension surrogate key (see Conformed Dimensions)
- Unknowns - Sometimes you can have facts where the natural key is NULL, an invalid value, or some weirdness. You can represent that condition and still maintain referential integrity in the database by having one of more special rows in your dimension table to represent Unknown, Invalid, Hasn't Happened Yet, or whatever is appropriate. (Technically, a value of NULL can't be a key, but some database engines will let you get away with it at the cost of performance and usability of your data warehouse)
- I'm sure I'm forgetting some really important one...

Generally, during the transform phase of loading the fact table, I lookup the surrogate key for the natural key coming from the source system and then store the surrogate key in the fact table instead of the natural key. I don't know what platform you're on, you could use JOINs on most any database platform to do this. I use SSIS lookups frequently on the Microsoft SQL Server platform.

The last paragraph answers the question is and the key concept is the **lookup** - a transformation that you use to replace the natural keys with the surrogate ones. It may be as simple as a JOIN between the fact table and dimension table on the natural key. — Marek Grzenkowicz, Dec 14 '15 at 17:45

score 2 · Answer 2 · answered Dec 14 '15 at 17:22

Surrogate key assignment is implemented in the ETL process of loading the fact tables.

The natural key e.g. product code ABSFG-QXYX-12673726 is mapped using a dedicated mapping table to a surrogate key typically integer, say 1238.

A surrogate key is useful and should be deployed in following scenarios:

The natural key is changeable, i.e. you need to report with same surrogate key despite the change of natural key
The natural key can be reused, i.e. you get the same natural key, but it must be reported as a new entity
The natural key is unhandy, e.g. extreme long hash code which could be problematic for storage, joins or sorting

There are some use cases where the usage of surrogate key should be critically questioned:

The natural key (e.g. the key extracted from the source system) is already surrogate (e.g. sequence generated key)
The source system cannot provide additional information about the natural key (e.g. the change or reuse information); i.e. the surrogate key will effectively be a 1:1 mapping of the natural key.
NEVER use surrogate key for a time dimension especially if the fact table is range partitioned on time (as the surrogate key would disabled the range partition pruning).

A surrogate key also has its drawbacks

It requires the mapping in the ETL process (performance)
The final report may require back-mapping to the natural key (performance)
Consistency – you must handle the case that the mapping to the surrogate key fails if an “unknown” natural key appears. Should you reject the fact record or is it the problem of the (incomplete) mapping table?
Of course ETL errors in mapping of surrogate keys can lead to problems…

So to draw a summary, I’d say yes, there are good reasons for using surrogate keys, but there are also good reasons for NOT using surrogate keys. You should always carefully examine the interface from the source system and for each dimension key check if the profit of using surrogate key is higher the costs.

The question was not about pros and cons of surrogate keys (as stated in my first sentence), but how can I turn the natural foreign keys in my fact table into surrogate keys? I already know how to create surrogate keys in the dimensions. Thanks — Davide, Dec 15 '15 at 16:40
@Davide - you are completely right. This is of course much harder to answer. I’ll try to add/reformulate my answer. — Marmite Bomber, Dec 15 '15 at 17:05

score 2 · Answer 3 · answered Dec 14 '15 at 22:13

To answer strictly to the question "how I can link them to my fact table's foreign keys" you should load your dimensions first by assigning surrogate key, then load your facts by looking up to dimensions by business key and locate surrogate key and use it to store the facts. Alternatively, in case of late arriving dimensions, if you can't find a dimension member by business key on fact loading, then create dimension member using business key and use assigned surrogate key. Later, when dimension member will arrive to the DWH just update additional attributes in it.

From practical point of view, if there's no need for near-real-time DWH, or if you update your DWH just twice per day, e.g nightly and at lunch time, and it takes just couple of minutes to reload it from scratch, and if your main fact tables have only few millions of records then don't bother with surrogate keys. In practice, this is good if you have really HUGE workload and many millions of facts. You can get some insight by reading this article https://technet.microsoft.com/en-us/magazine/2008.04.dwperformance.aspx

If you want to utilize really big DWHs, e.g. massive parallel DWH or Hadoop technologies, then you should change your surrogate keys to hash keys to minimize data movement, avoid data skew and to provide balanced execution.

Really appreciate your answer to the actual question! Besides, how can one handle the scenario in which dimensions do not have a business key naturally? — basquiatraphaeu, Feb 27 '22 at 20:02

creating dimension surrogate keys

3 Answers3