0

I am designing a datawarehouse which is fed by a lot of source system. My current ETL process has a step where the primary keys of source systems (usually multiple columns) are concatenated, hashed (via MD2) and then saved in a uniqueidentifier column.

This uniqueidentifier is then used as the primary/foreign keys in my datawarehouse.

I choose this approach to handle updates of my dimensions. I am aware that data is just replaced and that is OK. To update my dimensions I use a SQL Server merge statement which updates existing records and adds new ones.

Because hashing the same source primary key will result in the same uniqueidentifier, merge statement from staging tables to datawarehouse tables will update the dimension record. I am also independent of the size of the source primary key. Hashing will always be of a fixed size.

This design is currently in testing phase.

Users are reading the datawarehouse with Microsoft PowerBI and some were already complaining that the size of their dataset is very large. This is not a huge issue right now, but I fear that it could be an issue in the future.

Now I am questioning my Datawarehouse design.

Should I switch to a different design to reduce the size of my tables? Are there any other negative impacts using uniqueidentifiers?

Example:

Fact table PURCHASEORDERS in source system POSYSTEM:

purchaseOrderNumber ItemNumber SupplierId Created
10000 1 999 01.01.2021
10001 1 998 22.06.2021

Dimension SUPPLIER in system SUPPLIERSYSTEM:

supplierId Name Country Updated
999 Supplier A DE 01.01.2021
998 Supplier B GB 22.06.2021

ETL Step 1 (load new and updated records into staging tables):

purchaseOrderNumber ItemNumber SupplierId supplierKey Created
10001 1 998 22.06.2021
supplierKey supplierId Name Country Updated
998 Supplier B GB 22.06.2021

ETL Step 2 (calculate HashKeys)

(In this case I only have one key column. In the case If I had more, I would do a HASHBYTES('MD2' CONCAT(Column1, ';', Column2))

Fact.supplierKey = HASHBYTES('MD2', SupplierId)

Dimension.supplierKey = HASHBYTES('MD2', SupplierId)
purchaseOrderNumber ItemNumber SupplierId supplierKey
10001 1 998 64C23286-6124-9019-1B91-000081E1564B
supplierKey supplierId Name Country Updated
64C23286-6124-9019-1B91-000081E1564B 998 Supplier B GB 22.06.2021

ETL Step 3 (merge into datawarehouse tables (update/insert)):

I use update/insert because I don't need to have a history of my dimensions.

purchaseOrderNumber ItemNumber supplierKey
10000 1 64C23286-6124-9019-1B91-000081E1564A
10001 1 64C23286-6124-9019-1B91-000081E1564B
supplierKey Name Country Updated
64C23286-6124-9019-1B91-000081E1564A Supplier A DE 01.01.2021
64C23286-6124-9019-1B91-000081E1564B Supplier B GB 22.06.2021

Thank you for feedback!

Dale K
  • 25,246
  • 15
  • 42
  • 71
magges
  • 1
  • 1
  • Does this answer your question? [What are the best practices for using a GUID as a primary key, specifically regarding performance?](https://stackoverflow.com/questions/11938044/what-are-the-best-practices-for-using-a-guid-as-a-primary-key-specifically-rega) – Panagiotis Kanavos Jun 22 '21 at 07:04
  • As the duplicate says - don't use GUIDs as clustering (primary) keys. As for how you generate the keys, that's actually far worse and unfixable. With a real GUID, you could use `NEWSEQUENTIALID` at least to generate sequential values and avoid fragmentation. Calculating the key the way you did though will generate essentially random values. In the end, it's duplicating the data already found in other columns. None of the advantages you mentioned are real either. The size of the source key doesn't matter because it won't be used as a PK for a dimension in the first place. – Panagiotis Kanavos Jun 22 '21 at 07:07
  • You can't just use a MERGE statement to update dimensions either. Each type of dimension requires a different strategy. In some cases, you can just replace the existing record, erasing history. In other cases you need a *new* record. You do need to rethink the design. A very good place to start is [Dimensional Modeling Techniques](https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/) and Ralph Kimball's books in general – Panagiotis Kanavos Jun 22 '21 at 07:16
  • @PanagiotisKanavos thanks for feedback. I added an example in my question which describes the ETL Process I use. The Update strategy is not requiring keeping a history. I also cant think of generating a random ID (IDENTITY or NEWSEQUENCIALID) because I need to have the same keys in my dimensions and in my fact table. They have different source databases. – magges Jun 22 '21 at 07:50
  • IDENTITY isn't random,unlike what you use right now. It's a sequential, ever increasing value, unlike the **non**-GUID you want to use which generates values in a random sequence, resulting in fragmentation. Why *didn't* you use an `int` with an `IDENTITY` constraint from the start? There's a reason all tutorials, books and courses on data warehouses use simple keys with *no* business meaning. – Panagiotis Kanavos Jun 22 '21 at 08:06
  • Even if you want to key on a hash, don't pretend it's a `UNIQUEIDENTIFIER` -- it's not, as those have some version bits in the middle. What you've got is 16 bytes, so you may as well use a `BINARY(16)` for it. (You are also then not required to use exactly 128 bits.) Note that these have a literal format as well (`0x123456...`) so you don't have to use `UNIQUEIDENTIFIER` just to make that convenient. – Jeroen Mostert Jun 22 '21 at 08:07
  • Besides `MD2` is deprecated in all supported SQL Server versions. That means it's going to be removed – Panagiotis Kanavos Jun 22 '21 at 08:07
  • @Bill you **have** to include the source key in the dimension table anyway, as a separate field. Business users know that key and will use it to look up data in their source systems. If you include that, you won't need to use the generated key to merge records. – Panagiotis Kanavos Jun 22 '21 at 08:11
  • @PanagiotisKanavos how is the procedure of getting the generated INT IDENTITY from my dimenstion table into my fact table? Do I need keep supplierId in my dimension and then to do an INNER JOIN with supplierId and write dimension.supplierKey to fact.supplierKey – magges Jun 22 '21 at 08:46

0 Answers0