2

Context

Tables ranging from 1M - 200M records. All tables in question have zero indexes on them as they are staging/landing areas for our raw data sources. The purpose of having cast(NULL as varchar(max)) below was to handle 1 situation where our hashbyte function needed to handle a combination of varchar(255) columns that totaled up to over 8000 bytes, so a developer implemented that on our entire loading process across the board.

This hashbyte function is being used to identify when a record changes for incremental load.

I attempted to simulate this with small temp tables, but could not simulate this.

Code in Question

CAST (
        HASHBYTES
        (
            'SHA2_256',
            CONCAT
                (
                    [CTE].[TransactionDetailTypeCode], --varchar(50)
                    '|',
                    [CTE].[TransactionDetailAmount], --decimal(19,4)
                    '|',
                    [CTE].[OriginalTransactionPostDate], --date
                    '|',
                    [CTE].[SourceSystemCode], --varchar(50)
                    '|',
                    [CTE].[EtlSourceTransaction], --varchar(500)
                    '|',
                    [CTE].[EtlSourceTransactionDetail], --varchar(500)
                    '|',
                    CAST(NULL AS VARCHAR(MAX)) --COMMENT THIS LINE OUT
                )
        )
        AS varbinary(32)
    ) AS [EtlRowHashValue]

Query Plans

  • Query Plan with varchar(max) code included

QueryPlanNoCommentOut

  • Query Plan with varchar(max) code commented out

QueryPlanCommentedOutLine

Discussion

My findings are that the subtree cost with the code snippet included is over 22 times higher than having it commented out. I also noticed it forcing a sort for the merge join, which is what I'm suspecting is the cause of the abysmal performance. I'm assuming there's not an easy answer for this one, but out of curiosity,

Can someone please explain whether its the use of the HASHBYTE, the CONCAT or some other reason as to why including cast(NULL as varchar(max)) would cause a query plan to change this drastically?

Is this being caused because of some under-the-hood functionality that I'm perhaps just being ignorant about?

Adam
  • 2,422
  • 18
  • 29
  • 1
    Because it's `varchar(max)`, a type meant to store 2GB-sized blobs of text. Don't use `varchar(max)` unless you really want to store huge files – Panagiotis Kanavos Aug 21 '23 at 14:42
  • Why are you implicitly converting your entire string into a `varchar(MAX)` (there by staring that the concatenated value will be *larger* than 8,000 characters, even though your definitions state otherwise) and then only getting the first 32 bytes from `HASHBYTES` would be my question. Even if we allow 20 characters for `OriginalTransactionPostDate` your string would only be at most 1,146 characters; that's *far* less than 8,000 – Thom A Aug 21 '23 at 14:43
  • You don't need hashing for incremental loads. All versions and editions of SQL Server offer [change tracking](https://learn.microsoft.com/en-us/sql/relational-databases/track-changes/about-change-tracking-sql-server?view=sql-server-ver16), a cheap way to retrieve the PK values of changed (added/modified/deleted) rows since the last sync, along with the change reason. You can't find deleted rows with hashing – Panagiotis Kanavos Aug 21 '23 at 14:44
  • @ThomA, this was done in 1 use case by another developer to handle the situation for when the columns in the hash DID total to more than 8000 characters. This was then sweeped throughout all our loading processes "in case" they ever did occur again. Rest assured, I ripped that line out across the board and am here to understand if it was the varchar(max) itself, or what one of the functions was doing behind the scenes. – Adam Aug 21 '23 at 14:46
  • In any case, generating the hash should be done when the data is stored, not when it's queried. You can create a persisted computed column with the hash expression so the server *doesn't* have to keep calculating the hash value. That still a whole lot slower than using change tracking or the more expensive but more resilient [Change Data Capture](https://learn.microsoft.com/en-us/sql/relational-databases/track-changes/about-change-data-capture-sql-server?view=sql-server-ver16) – Panagiotis Kanavos Aug 21 '23 at 14:46
  • 1
    I don't think the extra column itself is the problem, since hashbytes output is of stable type. What i saw is that adding varchar(max) adds an extra compute scalar that upscales all the arguments to concat to MAX itself. Instead it's probably the standard issue, when you change a query you get a new plan, which might not be good (or it might be). – siggemannen Aug 21 '23 at 14:53
  • @PanagiotisKanavos I appreciate the best practices advice but our current tech stack for this ODS is LAND, DELTA (changes), and TARGET. Cloud warehouse then uses CDC on TARGET to pick up changes which are tracked up top. CT might be a great tool, but I don't have control over enterprise decisions which is why I'm here to understand this specific case. Thanks. – Adam Aug 21 '23 at 14:56
  • CDC doesn't need hashing either. You can't just cover up a bad practice with some extra code. Calculating 1M hashes, much less 200M hashes each time is extremely expensive. Even worse if you have to perform 1Mx1M joins *and* compare hashes to find the differences. You can limit the cost by using a persisted computed column – Panagiotis Kanavos Aug 21 '23 at 14:58
  • Hashing is needed when the external data is duplicated across loads without a practical way of identifying individual rows or new data. If, for example, you retrieve snapshot files with a changing order across snapshots, you may need a hash to check whether a row was encountered in the past. If the rows had a unique key, or even if the order was maintained from one file to the next, you wouldn't need a hash. You could use the key, or even the line number, to identify new rows – Panagiotis Kanavos Aug 21 '23 at 15:03
  • 2
    Did you compare memory grant information between the two plans? SQL Server makes decisions about query plan strategy at least partially based on how much information it may need to store in memory. When you add a varchar(max) value to every row I'll let you guess when happens to guesses about memory. :-) – Stuck at 1337 Aug 21 '23 at 15:14

0 Answers0