0

here's the modified description of my problem:

I have a situation here that I've been trying to solve but have hit a wall and would like different ideas.

Scenario: data files are transferred from multiple stores (transactions) to the head office as they happen. for each store the table is a decent size. At the head office, that same table contains the same info from all stores which makes it huge. the data is transferred through files as there isn't any direct connections between the HO and the stores.

I'm trying to find a way to make sure that the data from each store has made it safe a sound to the head office table. To do this I have tried to do a checksum_AGG of a checksum for the fields that should match between the store and the HO... But the performance was poor. I've also tried the sum of specific numeric columns and do the same at host for a specific store to compare but the performance seems to be poor also.

I have verified the fragmentation of the indexes and all is good.

Ideally I would love to transfer all the data from the store line by line and compare it against the HO but it's unthinkable since it's way to big.

I'm looking for ideas that I can explore to see if performance is acceptable. The idea is to get some type of checksum of a table per date (store/Date) at a store and do the same operation at the HO... and then compare the 2 values... This means that I would only need to transfer a checksum value per store and date to the HO.

Any idea is appreciated.

Thank you

JohnG
  • 272
  • 2
  • 14

1 Answers1

1

For comparing rows, I am using HASHBYTES. For the whole table, you can use HASHBYTES of the rows hashes.

You need something like this:

HASHBYTES('SHA2_512', CONCAT([ColA], [ColB], ..., ColZ]);

Few important notes:

  1. You are using SQL Server 2014 and as it said in the documentation, you are not allowed use MAX values as input:

    For SQL Server 2014 (12.x) and earlier, allowed input values are limited to 8000 bytes.

    So, you can end up with multiple hashes for row (combining different columns on parts) or to be not able to use this technique if you have max length types.

  2. You need to decide how to handle NULL values and empty strings (if such exists).

    For, example, the following hashes are the same, but one of our imaginary columns is NULL and the other is empty string:

    SELECT HASHBYTES('SHA2_512', CONCAT(1, '', NULL, '')); --0x4DFF4EA340F0A823F15D3F4F01AB62EAE0E5DA579CCB851F8DB9DFE84C58B2B37B89903A740E1EE172DA793A6E79D560E5F7F9BD058A12A280433ED6FA46510A
    SELECT HASHBYTES('SHA2_512', CONCAT(1, '', '', ''));  --0x4DFF4EA340F0A823F15D3F4F01AB62EAE0E5DA579CCB851F8DB9DFE84C58B2B37B89903A740E1EE172DA793A6E79D560E5F7F9BD058A12A280433ED6FA46510A
    

    The things can go crazy, if you are not careful. For, example, I have case like this (different column values, same hash):

    SELECT HASHBYTES('SHA2_512', CONCAT(1, '', NULL, 2));  --0x5AADB45520DCD8726B2822A7A78BB53D794F557199D5D4ABDEDD2C55A4BD6CA73607605C558DE3DB80C8E86C3196484566163ED1327E82E8B6757D1932113CB8
    SELECT HASHBYTES('SHA2_512', CONCAT(1, '', 2, NULL));  --0x5AADB45520DCD8726B2822A7A78BB53D794F557199D5D4ABDEDD2C55A4BD6CA73607605C558DE3DB80C8E86C3196484566163ED1327E82E8B6757D1932113CB8
    

    That's why ended up using a separator between each value. I am using CHAR(26), but you can use any of not the special chars as long as you are sure it is not used in the data. So, the final code is:

    HASHBYTES('SHA2_512', CONCAT([ColA], CHAR(26), [ColB], CHAR(26), ..., CHAR(26), ColZ]);
    

Basically, you can look for CLR function which is computed the hash of whole result set - it will be more difficult to implement, but if you are familiar with .net you will be OK. Check this example of creating SQL CLR function and the official docs.


It might be difficult to setup, but I am using Data Checksum to for creating unit tests. Basically, this type of test is build-in, so after the environment is ready, you can use it:

enter image description here

What is bad is you need to calculate the checksum in advanced (for example using your source data) and then to change the SQL to query your new data. Do not think it will be very suitable for your needs.

And talking about unit tests, you can check this framework tSQLt - I am definitely not a fan, but there might be some routine which is performing checksum on result set.

gotqn
  • 42,737
  • 46
  • 157
  • 243
  • gotqn, thank you... I will definitely try this in my lab and let you know how it goes. I wasn't aware of this. – JohnG Aug 10 '18 at 18:25
  • Gotqn, Looks interesting but I looking to aggregate one value per day per store. i.e. for a store I will get hashbyte value per day and at the HO I will do the same functionality and it will one value per day per store. I tried using it, but to get one value per date per store, I need to aggregate the data within the concat to get what I need and doing so just makes too heavy. – JohnG Aug 10 '18 at 18:47
  • Not sure I understood the issue. You can add a new column to store the hash for each row - populate the hash via trigger or on insert/update; then on check use the precalculated values and do not waste time – gotqn Aug 10 '18 at 20:00
  • gotqn since the table is huge and we don't have a direct connection with the stores' databases, we have to pass the data through a file... Which means that I need to export/import millions of rows into a file. What I'm trying to find is something similar to what you suggested but do it for a store/date... so 1 hash per store/date which I can compare with the store /date at the HO. I tried to hash hash the sum of a few columns and count, which worked but I'm not sure that it is better than using the Checksum_AGG and checksum functions. I appreciate the time you're putting. Imodified description – JohnG Aug 13 '18 at 13:18
  • using your idea, I've tried to create a persisted computed column using the hashbytes, but with my luck, it failed due to the fact that the table was created with ANSI_NULLS OFF, and since this application is distributed on a large scale, I can't really change that... my hands are tied and have to find another solution. In a perfect world, I would have added the computed column in an index and aggregated it per store per date. +1 – JohnG Aug 14 '18 at 12:56
  • Unless you need to parse the concatenated string, any character delimiter may be used is to make a row with null values / empty strings has to distinct values. using 'a' (since it appears in the data) we see: `"a", "b", "c"` concatenated is `"abc"`, with a delimiter is `"aabac"` While `"", "", "abc"` concatenated is `"abc"` , with a delimiter is `"aabac"` The strings with 'a' as delimiter are hard to read, and cannot be parsed, but they have do have different hash values. – Andrew Dennison Jul 12 '20 at 16:34