Surrogate key vs. composite key in data warehouse

Question

In Data Warehouse(DW) we have dimensions and facts table. Dimensions keys migrate as foreign keys into facts table, and thus in facts table forms composite primary key in that tables. Of course, sometimes is not needed all foreign keys to create unique primary key, because in most situations uniqueness is defined by few foreign keys.

But, I was wondering why fact tables doesn't have surrogate key as primary key, like dimension tables? First, when indexing primary key in DW (nonclustered index - some kind of best practices) is it better to have one column in index, or five? I know that DW systems doesn't care much about disk amount that indexes reserves, but every time for me is logical to have surrogate key instead composite foreign key.

Can someone explain why this is not standard practice?

score 0 · Answer 1 · answered May 14 '14 at 09:31

0

This is because it is rare to reference the fact table by anything other than the dimensions.

answered May 14 '14 at 09:31

Ronnis

12,593
2
32
52

score 0 · Answer 2 · answered May 17 '14 at 04:40

Data Warehouse is as effective as its ability to load the data and access it later. As explained by Ronnis; we seldom need the capability to reference the combination of the composite keys in a fact table to do any operation; While loading, if we have a primary key then the loading takes that much longer which is fine if we need to access the record by the primary key; else it is a waste of time.

Consider the following fact

CustomerId
Date
PartId
NumberofOrders
OrderQuantity
InvoiceAmount

The keys are CustomerId, Date, PartId

Having an additional record for the combination of the keys will not impact any analysis that is done on this fact table, so it might be redundant to have a primary key.

Consider the following fact

CostCenterId
DivisionId
Month
DepartmentId
OpeningBalance
Credits
Debits

There are two types of designs that are possible; one is when you want only a single record for this combination; in which case you will define a primary key

An alternate design is also possible; create the opening balance record at the start of every month When you process the transactions, populate the opening balance as zero in which case, you can have multiple records

The bottom line is ensure you have a manageable number of records to summarize, pay more importance to ensure your loading is faster.

Hope this explains the trend in design of not using a primary key

score 0 · Answer 3 · edited Oct 22 '20 at 17:24

Depends upon your most used use cases for reporting / application sitting on top. In most data warehouse you end up using Surrogate Keys from dim tables as FK. In a fact, you may have duplicate (partial row) and for completely valid reasons. Thus, a PK is often not defined unless you really need it.

Depending upon your RDBMS technology and its capabilities on indexing, PK / Primary Index side you will define an optimal PDM (physical data model) that best serves your data warehouse. For instance, in Teradata, I would define a Primary Index (not PK) in a fact table that serves my purpose and may not have complete composite key in there. In an Event fact table or Network Activity fact table where we record each event as a transaction an EVENT ID will suffice as Primary Index to ensure optimal distribution of data for better performance. Separate index can be defined to optimize reporting if most of reporting use cases queried this table based on customer id or product id or site id or whatever ID that's often used. Then you can define similar index on such an ID where ever it is present in other Facts or dimensions giving you an optimal access path.

This heavily depends on your use case, rdbms capabilities & access path design to cleverly define a PDM.

In case of Surrogate Keys, please refer to my post as a guideline here Managing surrogate keys in a data warehouse

Surrogate key vs. composite key in data warehouse

3 Answers3