What is the fastest way to spot un-matching records between 2 tables?

Question

I am trying to spot some broken records in a MS-SQL Database.

In a simplified example, the scenerio is this:

I have 2 tables, simply put:

Table_1 : Id,Date,OpId
Table_2 : Date,OpId,EventName

And I have this business rule: If there is a record in Table_1 THEN at least 1 row should exist in the Table_2 for the Table_1.Date and Table.OpId.

If there is a row in Table_1 and if there is no row matching with that row in Table_2 THEN there is a broken data -whatever the reason-.

To find out the incorrect data, I use:

SELECT *
FROM table_1 t1
LEFT JOIN table_2 t2 ON t1.Date = t2.Date AND t1.OpId = t2.OpId
WHERE t2.OpId IS NULL -- So, if there is no 
--                           matching row in table_2 then this is a mistake

But it takes too long to have the query completed.

Can there be a faster or better way to approach similar scenerios?

score 6 · Accepted Answer · answered Jan 02 '12 at 13:21

6

To do an anti semi join NOT EXISTS in SQL Server is usually better than or equal to in performance the other options (NOT IN, OUTER JOIN ... NULL, EXCEPT)

SELECT *
FROM   table_1 t1
WHERE  NOT EXISTS (SELECT *
                   FROM   table_2 t2
                   WHERE  t1.Date = t2.Date
                          AND t1.OpId = t2.OpId)

See Left outer join vs NOT EXISTS. You may well be missing a useful index though.

answered Jan 02 '12 at 13:21

Martin Smith

438,706
87
741
845

Wow! It worked! By why is that much more faster in this query? – pencilCake Jan 02 '12 at 13:24
EXCEPT and NOT EXISTS are identical usually. And a better reference is http://explainextended.com/2009/09/15/not-in-vs-not-exists-vs-left-join-is-null-sql-server/ – gbn Jan 02 '12 at 13:25
@gbn - `EXCEPT` is only useful in the case that you don't need additional columns from the first table though so `NOT EXISTS` is a good default choice. (I did say "better than or equal to") – Martin Smith Jan 02 '12 at 13:27
@pencilCake - One other reason could be that `select *` in this query only return columns from table_1 instead of both tables as in your query. Don`t use `select *`. – Mikael Eriksson Jan 02 '12 at 13:28
It's messier than NOT EXISTS but it isn't "slower" which is the point I was making... – gbn Jan 02 '12 at 13:30
@gbn - I didn't say it was slower. [This article does](http://sqlserverpedia.com/blog/sql-server-bloggers/cage-match-i-anti-joins/) but I haven't looked into it myself. – Martin Smith Jan 02 '12 at 13:38
@pencilCake - Would need to see the plans. Could be a few reasons. You may well find the answer in either the link in my answer or gbn's link above though. – Martin Smith Jan 02 '12 at 13:40
Without seeing plans, I'm looking at the the reads which are consistent. Looks like LEFT JOIN is being optimised to a semi join there, instead if JOIN+filter (as oer Quassnoi's demo). Although, it looks like not the same semantically: http://stackoverflow.com/a/1664153/27535 – gbn Jan 02 '12 at 13:44
@gbn - Yep. Shame the OP there didn't post their table structures and population scripts so we could easily try and reproduce. – Martin Smith Jan 02 '12 at 13:47

Oleg Dok · Answer 2 · 2012-01-02T13:29:00.053

1

If you use proper indexing there is nothing to do with it (may be use NOT EXISTS instead of LEFT JOIN will be a little bit faster),

BUT

if the Table_1 is has relatively small amount of data and there is no any FKeys or other such a stuff, and this is a one time procedure, then you can use trick like this to drop incorrect lines:

SELECT table_1.*
INTO tempTable
FROM table_1 t1
WHERE EXISTS(SELECT * FROM table_1 t1 WHERE t1.Date = t2.Date AND t1.OpId = t2.OpId)

drop table Table_1

exec sp_rename 'tempTable', 'Table_1'

This may be faster

edited Jan 02 '12 at 13:29

answered Jan 02 '12 at 13:23

Oleg Dok

21,109
4
45
54

Sorry, changed erroneous table names – Oleg Dok Jan 02 '12 at 13:28
That is definetely faster Oleg; - I will check the article which Martin reccomended. I really want to understand the main reason behind this speed difference. – pencilCake Jan 02 '12 at 13:32
2

Its pretty simple - JOIN have to actually FIND all the rows in table1 matching rows in table2, but NOT EXISTS is enough to find the FIRST intersection to finally skip the row and the rest of join-like search and advance to the next row. AND when you use NOT EXISTS the optimizer chooses the best index it wants, instead of SELECT * ... LEFT JOIN, because in NOT EXISTS its doesn't matter of specific columns in table2, but in select-left join - it does - and that is limiting it or preventing to use matching indexes – Oleg Dok Jan 02 '12 at 13:35
Great explanation! Thanks! Oleg, by the can you reccomend me a TQSL book that covers those type of (advenced) details on writing TSQL queries and structering the Db in a better way? If you know any. – pencilCake Jan 02 '12 at 13:43
Tell me your Skype ID and I'll send you some books to start, to continue and to get stuck with 8-) – Oleg Dok Jan 02 '12 at 13:45

What is the fastest way to spot un-matching records between 2 tables?

2 Answers2