3

In data vault 2.0 one hashes the business key and takes this hash as a primary key of the table. Also Link Tables use the hash primary key to create a relationship.

My problem is with hashes that are basically random, the query optimizer cannot apply any good estimation since the statistics - of course - are not usable for randomly distributed data.

So the query optimizer uses weird plans where it wants to sort often (because it thinks there are only 4 rows to sort). Since I am surely not the first one to deal with data vault in sql server, how is this fixable?.

When query optimizer uses an index seek or a join operator it completely misses the row estimation hence chooses ridiculous plans.

I have to pimp them with join hints and query hints such as (FORCE ORDER) to get anything out of it.

What's the common approach for this?

BarneyL
  • 1,332
  • 8
  • 15
tuxmania
  • 906
  • 2
  • 9
  • 28
  • It would be much appreciated if add an example of your "mad" execution plan. I have some implementation and there are no such issues so far. In my implementation I have index seeks and scans exactly where they should be. To my mind, "random" hash can cause issues within inserting process (due to inserts in the middle of the clustered index). It would be also great if you provide a piece of your database structure – Vladimir Semashkin May 28 '18 at 15:44
  • There are a lot of ideas in DV that simply do not work in the real world and this is one of them. – Nick.Mc Jan 22 '22 at 09:11

2 Answers2

7

I firmly agree with your conclusion that hashing will make all data that had a structure / order to it totally random, which will make any form of useful database statistics impossible.

I actually did some experimenting on SQL server myself and came to the same conclusion as you did, supported by the Explain Plans

That is why I firmly believe you/we should consider using the concatenated business key as a primary key INSTEAD of hashing it.

Arguments that are give for hashing are in the realm of:

  1. Joining in Char(32) (the character string of an MD5 hash) is more performant compared to joining on variable character fields
  2. Hashing reduces hotspots in your MPP cluster when writing data

But I have yet to see proof for argument 1: as you are mentioning you lose any useful statistics when joining! Furthermore: a lot of natural business keys I know are actually much SMALLER than 32 characters... I actually have asked a question related to this subject a few days ago...

Then to argument 2: In most MPP NoSQL databases (Key-value, Document, Column Family) the advise is to actually use a good NATURAL (concatenated) key as the sharding key, not a hash. Example: see this advise for Cassandra.

This is why I do not agree with the Data Vault 2 Hashing theory: I have not seen any proof supporting this... It is one of the reasons why I am proposing a new Ensemble modeling approach @ DMZone Berlin in October.

Glorfindel
  • 21,988
  • 13
  • 81
  • 109
  • 1
    if only my colleagues would get the situation. thanks for your comment. There is also a big problem with clustered index key on hashes that is heavy fragmentation making inserts of a big kind basically impossble. The downside of heavy heap usage are also very clear... so thanks for your feedback again! – tuxmania Sep 16 '16 at 15:38
  • 1
    yes, a clustered index on the hash key is killing because of the fragmentation occurring there. I asked Dan Linstedt about this and his recommendation to me was to make the entities heaps, which imo also has a lot of problems: how are you going to selectively filter data out? – Rogier Werschkull Sep 19 '16 at 11:52
  • 2
    I can see **another option when using hashes**: Use a Cluster Index, but put the Load_DTS column **BEFORE** the Hash key. Normally this should not lead to page splits as the first part of the key is ever-increasing... And more importantly: you would still have a method for leveraging clustered index seeks if you only need the _'data that is new or changed during the last x days'_ – Rogier Werschkull Sep 19 '16 at 11:57
  • Yeah you are right but it is a very constructed and unnaturally thing. And again we have big key columns then datetime is 8 bytes and hash is at least 16 bytes if stored as binary(16) so compared to 4 byte identity int key 24 byte is 6x more which we lead to deeper B-Tree :) But i agree completely with you – tuxmania Sep 20 '16 at 11:02
  • Modern processors do not move bytes. Intel front-side buses are 64-bits wide and move 4 transfers at a transaction, which is 256 bits or 32 bytes. So anything smaller than 32 bytes still involves a 32-byte transfer. On disk/SSD, most DBMS will have housekeeping data around every field and record so fields grow in similar multiples (perhaps 8 or 16 bytes rather than 32 though). Any evaluation of data sizes should consider these effects. The harm from hashes is from the locality/clustering effects, not size. – cliffordheath Nov 13 '17 at 04:05
0

Personally I would not hash the BK but would include all the fields in the HUB if it was a compound key. Why would the LINK table be using the hash values, it should be using the HUB_ID which I would always set up as an incrementing value

I would however create and store a HASH in the SAT table as this is the fastest way to check for changes in the ETL process: Hash the values coming in and compare to the hash on the current SAT record- no point in recomputing the hash on the existing record.

Aaron Reese
  • 544
  • 6
  • 18