Surrogate keys and referential integrity in the EDW

Question

Problem overview
When adopting an Inmon style 3NF Enterprise Data Model, what are some common techniques for handling surrogate keys and referential integrity? In my case I have to populate a 3NF data model that provides an "enterprise view" of several transactional systems. In addition, each OLTP is distributed so that there is one instance per country. Thus, the challenge I'm currently faced with is consolidating each source system into a unified data model.

Actual question
Because each country has its own "local" PK I need a strategy for handling conflicts when consolidating these into the EDW. Is it most common to simply create a composite key in this case? e.g. source_id + source_country or would it better practice to generate a surrogate key here?

For example:

A.foobar
id
description
...

B.foobar
id
description
...

Would become:

EDW.foobar
id
foobar_id
source_country
description

So that in the consolidated data model we end up with a new surrogate key (id) which uniquely identifies each source record (foobar_id + source_country). This seems logical but feels wrong for some reason. And furthermore, hence my question, what impact will this then have on handling referential integrity in the EDW? i.e. if we generate new surrogate keys between source 3NF & EDW 3NF then there's the added complexity of referencing these new keys throughout the EDW schema. In terms of ETL implementation, it would mean having to lookup the newly generated surrogate key via the existing FK (source system) and then replacing it as the new FK. This means maintaining multiple FKs (one to look up the new surrogate key and the new surrogate key itself) in the EDW which seems very far fetched.

If anybody has experience with this problem then I'd appreciate your advice as I don't think my current approach is going to work. There are also several corollary topics e.g. versioning and historization, as well as cdc between EDW 3NF and data marts, that also come into play here but I'll come back to those later.

N.B.
Most of the research I've conducted relates specifically to populating Kimball style data marts, rather than Inmon's 3NF Enterprise Data Model. Furthermore, I've struggled to find anything useful on the topic of consolidating distributed databases whereby the underlying schema is the same.

score 0 · Answer 1 · edited May 23 '17 at 12:03

Generating a surrogate key is the most common way to deal with this scenario. So you'll have your surrogate key (which provides you with key stability and usually better DB performance) but still maintain your business key (because that's what you'll present on the business layer).

what impact will this then have on handling referential integrity in the EDW?

It shouldn't have any. Of course if this is an existing warehouse and you're introducing a surrogate key you're going to have to refactor to propagate the surrogate key throughout the warehouse, but that should be a once-off. Inside the warehouse everything should reference the surrogate key.

Here's an old discussion on the topic of surrogate vs business keys which is well worth the read: Surrogate vs. natural/business keys

score 0 · Answer 2 · answered Oct 15 '15 at 16:40

If you have a perfectly good PK for your countries table and you have another entity that forms a 1-1 relationship with countries, then by all means, use the country PK as the PK of this entity too. It will also serve as a FK reference to the countries table. This forms an identity relationship. That is, the relationship between a country and this other entity is so strong, the identity of the country also forms the identity of this entity.

Don't get into the habit of slapping a surrogate key on every table you create. Even if most tables end up with a surrogate key, the habit of doing so automatically makes for laziness of design and hides those times when a surrogate key is not the best option.

Surrogate keys and referential integrity in the EDW

2 Answers2