I am designing a datawarehouse which is fed by a lot of source system. My current ETL process has a step where the primary keys of source systems (usually multiple columns) are concatenated, hashed (via MD2) and then saved in a uniqueidentifier column.
This uniqueidentifier is then used as the primary/foreign keys in my datawarehouse.
I choose this approach to handle updates of my dimensions. I am aware that data is just replaced and that is OK. To update my dimensions I use a SQL Server merge
statement which updates existing records and adds new ones.
Because hashing the same source primary key will result in the same uniqueidentifier, merge statement from staging tables to datawarehouse tables will update the dimension record. I am also independent of the size of the source primary key. Hashing will always be of a fixed size.
This design is currently in testing phase.
Users are reading the datawarehouse with Microsoft PowerBI and some were already complaining that the size of their dataset is very large. This is not a huge issue right now, but I fear that it could be an issue in the future.
Now I am questioning my Datawarehouse design.
Should I switch to a different design to reduce the size of my tables? Are there any other negative impacts using uniqueidentifiers?
Example:
Fact table PURCHASEORDERS
in source system POSYSTEM
:
purchaseOrderNumber | ItemNumber | SupplierId | Created |
---|---|---|---|
10000 | 1 | 999 | 01.01.2021 |
10001 | 1 | 998 | 22.06.2021 |
Dimension SUPPLIER
in system SUPPLIERSYSTEM
:
supplierId | Name | Country | Updated |
---|---|---|---|
999 | Supplier A | DE | 01.01.2021 |
998 | Supplier B | GB | 22.06.2021 |
ETL Step 1 (load new and updated records into staging tables):
purchaseOrderNumber | ItemNumber | SupplierId | supplierKey | Created |
---|---|---|---|---|
10001 | 1 | 998 | 22.06.2021 |
supplierKey | supplierId | Name | Country | Updated |
---|---|---|---|---|
998 | Supplier B | GB | 22.06.2021 |
ETL Step 2 (calculate HashKeys)
(In this case I only have one key column. In the case If I had more, I would do a HASHBYTES('MD2' CONCAT(Column1, ';', Column2)
)
Fact.supplierKey = HASHBYTES('MD2', SupplierId)
Dimension.supplierKey = HASHBYTES('MD2', SupplierId)
purchaseOrderNumber | ItemNumber | SupplierId | supplierKey |
---|---|---|---|
10001 | 1 | 998 | 64C23286-6124-9019-1B91-000081E1564B |
supplierKey | supplierId | Name | Country | Updated |
---|---|---|---|---|
64C23286-6124-9019-1B91-000081E1564B | 998 | Supplier B | GB | 22.06.2021 |
ETL Step 3 (merge into datawarehouse tables (update/insert)):
I use update/insert because I don't need to have a history of my dimensions.
purchaseOrderNumber | ItemNumber | supplierKey |
---|---|---|
10000 | 1 | 64C23286-6124-9019-1B91-000081E1564A |
10001 | 1 | 64C23286-6124-9019-1B91-000081E1564B |
supplierKey | Name | Country | Updated |
---|---|---|---|
64C23286-6124-9019-1B91-000081E1564A | Supplier A | DE | 01.01.2021 |
64C23286-6124-9019-1B91-000081E1564B | Supplier B | GB | 22.06.2021 |
Thank you for feedback!