34

Is there a way to md5sum a row in a SQL table to check whether any column has been modified?

I would like to check whether any particular column has been changed versus an old copy of the data which i hold in a text file (which I will md5sum in C#).

EDIT: Just md5sum-ing each row

intrigued_66
  • 16,082
  • 51
  • 118
  • 189
  • 2
    Are you MD5ing the whole row, or just the text file? – MatBailie Jun 25 '12 at 08:59
  • 2
    The best approach for this would be to use a [`ROWVERSION` column](http://msdn.microsoft.com/en-us/library/ms182776.aspx) in your table. SQL Server will automatically update this column if any changes happen to the row. – marc_s Jun 25 '12 at 09:01
  • the table doesnt get modified, the text file does. I was hoping i could just md5sum the row and then compare it with the file row. Dont worry I will just extract each row and md5sum the database rows manually – intrigued_66 Jun 25 '12 at 09:02
  • 2
    This is how it can be done via a select statement: `SELECT Pk1 ,ROW_NUMBER() OVER ( ORDER BY Pk1 ) 'RowNum' ,(SELECT hashbytes('md5', ( SELECT Pk1, Col2, Col3 FOR XML raw ))) 'HashCkSum' FROM [MySchema].[MyTable];` where `Pk1` is the Primary Key of the table and `ColX` are the columns you want to monitor for changes. This should be valid for MS SQL Svr 2008/2012. – Al Dass Aug 26 '15 at 21:15

4 Answers4

35

There are CHECKSUM(*), BINARY_CHECKSUM(*) and CHECKSUM_AGG. They do CRC32 like checkum, but for detecting changes to a row it should be more than enough (you are talking about 1 in 4 billion chances for a false negative collision).

Doing a cryptographic hash using HASHBYTES requires you to construct an expression representing the 'row'.

Remus Rusanu
  • 288,378
  • 40
  • 442
  • 569
  • 5
    Depending on the environment, 1 in 4 billion can still be frequent enough to expect it is likely to eventually happen. Also, in some environments, such an event could be catastrophic. I don't know the OPs needs, but don't even MS recommend to *not* use `CHECKSUM()` to determine row changes? – MatBailie Jun 25 '12 at 09:05
  • 2
    @Dems: W/o knowing the OP needs I can't say whether CHECKSUM is good enough. I wouldn't dismiss it simply because the convenience of `*` is hard to match doing `HASHBYTES`. Coming up with an expression that represents the row in order to be hashed is not trivial, and even more so when you consider schema changes. – Remus Rusanu Jun 25 '12 at 09:11
  • 6
    @Dems: but I agree that 1 in 4 billion in many cases is too high probability. Keep in mind though that since you're comparing a known row with another known row, base don PK, there is no [meet-in-the-middle](http://en.wikipedia.org/wiki/Meet-in-the-middle_attack) issue. If false negatives result in world-end disaster then the *only* solution is to do a full byte-by-byte comparison, *all* hashes are collision prone by definition. – Remus Rusanu Jun 25 '12 at 09:17
  • 2
    Those functions are not so good as it seems in MSDN. It is easy to get a CHECKSUM collision if the input is similar. Take a look here: [link](http://stackoverflow.com/questions/12841331/tsql-checksum-conundrum) Also, if you are using CHECKSUM_AGG be carefull - it uses XOR under the hood, so it is pretty easy to get collisions too. You can read about it here: [link](http://michaeljswart.com/2009/02/checksum_agg-a-very-nifty-function/) I would not trust those functions alone – Kaspars Ozols Dec 28 '13 at 12:33
  • 3
    Two caveats: HASHBYTES is limited to 8000 bytes of input and CHECKSUM() generates the same values for positive and negative decimals- i.e., if x is an int CHECKSUM([other columns], x) and CHECKSUM([other columns], -x) will be different, but if x is a decimal they will be the same. – utexaspunk Apr 25 '19 at 13:33
  • @RemusRusanu I don't know how the algorithm works, but I suspect that there is zero chance of a collision between two values that are somewhat similar. In other words, hashing `abcd` and `abce` probably has no chance of colliding. I could be wrong about that, but I bet I'm not. And if a row has changed, it's likely that much of the row is still similar. Other considerations, such as comparing the same row to itself and not others, means it's 1 in 4 billion *every time*. I don't think It's like flipping a quarter where the chance of flipping the same value repeatedly continually decreases. – BVernon Sep 20 '21 at 22:34
  • @BVernon https://en.wikipedia.org/wiki/Birthday_problem ... – Remus Rusanu Sep 21 '21 at 17:20
19

If you have SQL Server 2008 or newer you could use:

SELECT HASHBYTES('SHA1', (SELECT TOP 1 * FROM dbo.Table FOR XML RAW))

or

SELECT  
    HASHBYTES('SHA1',(
        SELECT  * 
        FROM    dbo.myTable as  tableToHash 
        where   tableToHash.myUniqueKey=myTable.myUniqueKey 
        FOR XML RAW
    ))                                              as  rowSHA1
from    dbo.myTable;
jumxozizi
  • 642
  • 10
  • 21
Muflix
  • 6,192
  • 17
  • 77
  • 153
  • that worked for me. i used a where clause to limit to the current row – frostymarvelous Sep 15 '16 at 15:22
  • 1
    A bit late, but could you explain why the `FOR XML XXX` is required for such an operation ? – Itération 122442 Mar 24 '20 at 09:36
  • 2
    @FlorianCastelain because you need to somehow get columns into one text line, you can also use `SELECT cast(column1 as nvarchar(max)) + cast(column2 as nvarchar(max)) + ... FROM dbo.myTable as tableToHash` instead – Muflix Mar 24 '20 at 09:57
6

I had to develop a solution to compare table structures and run a procedure to import the difference between the tables.

I used below code to select the data

--> table structures

create table #table1 (
campo varchar(10)
,campo1 varchar(10)
)

create table #table2 (
campo varchar(10)
,campo1 varchar(10)
)

--> Insert values

insert into #table1 values ('bruno',1)
insert into #table1 values ('bruno',2)
insert into #table2 values ('bruno',1)
insert into #table2 values ('bruna',2)

--> Create a hash column to compare

select *,HASHBYTES('SHA1', (select z.* FOR XML RAW)) as hash
    into #compare1
 from #table1 z 

select *,HASHBYTES('SHA1', (select k.* FOR XML RAW)) as hash
    into #compare2
 from #table2 k 

--> check the lines that has any difference

select * from  #compare1 a
full outer join #compare2 b on a.hash = b.hash
where ( a.hash is null or b.hash is null )

Maybe this is useful for someone needing the same thing Find code explaned above here

Bruno Dantas
  • 98
  • 2
  • 11
1

Get all row, do the md5 of each column and compare it with the md5 of the correspondent column in the previous data.

aF.
  • 64,980
  • 43
  • 135
  • 198