Find variations between two database tables efficiently

Question

I have a set of stored procedures. Each stored procedure supposedly keeps a specific database table in sync with an identical one in another database.

The database tables have up to hundreds of millions of records. I need to find the quickest way to validate that these procedures are really keeping everything in sync, and I need to be able to locate records which vary between the two tables for each procedure (for debugging purposes).

I was informed that the following (found somewhere on SO I believe, but I don't have the link as it was a while back):

Insert into target_table(columns)
select columns from table1
except
select columns from table2

Insert into target_table(columns)
select columns from table2
except
select columns from table1

Wouldn't work fast enough. Can anyone suggest another way to do this that would be faster - either using T-SQL procedures, or even external C# code? (I thought C# code might let me store PKs for hashing purposes so I could at least track the primary keys and find which were surperfluous/missing even if I didn't track the rest of the fields).

Can you change the schema to add an additional checksum type column to both? — Martin Smith, May 29 '12 at 13:00
possibly - what would you base the checksun on? and what would you put in the after trigger? — John Humphreys, May 29 '12 at 13:03
@w00te - The idea being to use some sort of hashing function across all columns in the row to generate a value that could then be used in a `FULL OUTER JOIN T2 ON T1.PK = T2.PK WHERE EXISTS (T1.HASH EXCEPT T2.HASH)` but this would need to be indexed and always risk of collisions. — Martin Smith, May 29 '12 at 13:15
I'd rather not as not something that I've implemented myself before and it's all a bit hand wavy & vague at the moment. If you decide to go that route and it works out OK maybe self answer with more specifics once you have it implemented. — Martin Smith, May 29 '12 at 13:22

score 3 · Accepted Answer · answered May 29 '12 at 13:50

Is fairly difficult to do this, but you can get some mileage out of checksums. One approach is to split the key range into several subranges that can be verified a) in parallel and/or b) at different scheduled intervals. Eg:

use master;
go

set nocount on;
go

if db_id('test') is not null
begin
    alter database test set single_user with rollback immediate;
    drop database test;
end
go

create database test;
go

use test;
go

create table data (id int identity(1,1) not null primary key, 
    data1 varchar(38),
    data2 bigint,
    created_at datetime not null default getdate());
go  

declare @i int = 0;
begin transaction   
while @i < 1000000
begin
    insert into data (data1, data2) values (newid(), @i);
    set @i += 1;
    if @i % 1000 = 0
    begin
        commit;
        raiserror (N'Inserted %d', 0, 0, @i);
        begin tran;
    end
end
commit  
raiserror (N'Inserted %d', 0, 0, @i);
go

backup database test to disk='c:\temp\test.bak' with init;
go

if db_id('copy') is not null
begin
    alter database copy set single_user with rollback immediate;
    drop database copy;
end
go

restore database copy from disk='c:\temp\test.bak'
with move 'test' to 'c:\temp\copy.mdf', move 'test_log' to 'c:\temp\copy_log.ldf';
go

-- create some differences
--
update test..data set data1 = newid() where id = cast(rand()*1000000 as int)
update copy..data set data1 = newid() where id = cast(rand()*1000000 as int)

delete from test..data where id = cast(rand()*1000000 as int);
insert into copy..data (data1, data2) values (newid(), -1);


-- do the check
--
declare @id int = 0;
while @id < 1010000
begin
    declare @chk1 int, @chk2 int;
    select @chk1 = checksum_agg(binary_checksum(*)) from test..data where id >= @id and id < @id + 10000
    select @chk2 = checksum_agg(binary_checksum(*)) from copy..data where id >= @id and id < @id + 10000
    if @chk1 != @chk2
    begin
        -- locate the different row(s)
        --
        select t.id, binary_checksum(*) as chk
            from test..data t
            where t.id >= @id and t.id < @id + 10000
        except
        select id, binary_checksum(*) as chk
            from copy..data c
            where c.id >= @id and c.id < @id + 10000;

        select t.id, binary_checksum(*) as chk
            from copy..data t
            where id >= @id and id < @id + 10000
        except
        select id, binary_checksum(*) as chk
            from test..data c
            where c.id >= @id and c.id < @id + 10000;
    end
    else
    begin
        raiserror (N'Range %d is OK', 0,0, @id);
    end
    set @id += 10000;
end

The main issue is that identifying the differences can only be achieved by scanning all the rows, which is very expensive. Using ranges you can submit various ranges to be verified on a rotating schedule. The CHECKSUM_AGG and BINARY_CHECKSUM(*) restrictions apply, of course:

BINARY_CHECKSUM ignores columns of noncomparable data types in its computation. Noncomparable data types include text, ntext, image, cursor, xml, and noncomparable common language runtime (CLR) user-defined types.

score 1 · Answer 2 · edited May 23 '17 at 11:49

1

These are 2 queries that I use for that purpose

Table Checksum

Select CheckSum_Agg(Binary_CheckSum(*)) From Table With (NOLOCK)

Row Checksum

Select CheckSum_Agg(Binary_CheckSum(*)) From Table With (NOLOCK) Where Column = Value

Credits go to Hidden Features of SQL Server

edited May 23 '17 at 11:49

Community

1
1

answered May 29 '12 at 13:34

buckley

13,690
3
53
61

Find variations between two database tables efficiently

2 Answers2

Linked