Data discrepancies/comparison between three tables - SQL Server 2014

Question

I have three tables in SQL Server 2014, each table has millions of data and keep growing. I am trying to find discrepancies between the tables, for example:

DECLARE @ab TABLE
(
    k1 int,
    k2 int,
    val char(1)
)

DECLARE @cd TABLE
(
    k1 int,
    k2 int,
    val char(1),
    add_cd varchar(50)
)

DECLARE @ef TABLE
(
    k1 int,
    k2 int,
    val char(1),
    add_ef varchar(50)
)

INSERT INTO @ab VALUES(1,1,'a'), (2, 2, 'c'), (3, 3, 'c'), (4, 4, 'd'), (5, 5, NULL), (7, 7, 'g')

INSERT INTO @cd VALUES(1,1,'a', 'DSFS'), (2, 2, 'b', 'ASDF'), (4, 4, NULL, 'SDFE')

INSERT INTO @ef VALUES(1,1,'a', 'SD1245'), (2, 2, 'b', 'EW3464'), (3, 3, 'd', 'DF3452'),(4, 4, 'd', 'FG4576'), (6, 6, 'e', 'RT3453')

The common Key columns for all three sets are k1 and k2, I would like pull only the discrepancies either the value of "val" should be different or the key combination shouldnt exist in all three sets. No need to compare the additional columns(add_cd and add_ef) just needed in the final result. The desired result is:

k1   K2   val   k1    k2    val  add_cd  k1   k2    val   add_ef
2    2    c     2     2     b    ASDF    2    2     b     EW3464 
3    3    c     NULL  NULL  NULL NULL    3    3     d     DF3452
4    4    d     4     4     NULL SDFE    4    4     d     FG4576
5    5    NULL  NULL  NULL  NULL NULL    NULL NULL  NULL  NULL
NULL NULL NULL  NULL  NULL  NULL NULL    6    6     e     RT3453
7    7    g     NULL  NULL  NULL NULL    NULL NULL  NULL  NULL

I tried the below query, it gives desired result but works only with few thousands not with millions. Created indexes for the key columns but sill I see it uses table scan. Can anyone please advise on this?

SELECT a.*, c.*, e.*
FROM @ab a
FULL OUTER JOIN @cd c   ON  a.k1 = c.k1
                        AND a.k2 = c.k2
FULL OUTER JOIN @ef e   ON  (c.k1 = e.k1
                        AND c.k2 = e.k2 ) 
                        OR (a.k1 = e.k1
                        AND a.k2 = e.k2 )       
WHERE   (a.k1 IS NULL OR c.k1 IS NULL OR e.k1 IS NULL)
OR      (ISNULL(a.val, '') != ISNULL(c.val, ''))
OR      (ISNULL(c.val, '') != ISNULL(e.val, ''))
OR      (ISNULL(a.val, '') != ISNULL(e.val, ''))

Possible duplicate of [How to select rows with no matching entry in another table?](https://stackoverflow.com/questions/4076098/how-to-select-rows-with-no-matching-entry-in-another-table) — Tab Alleman, Mar 14 '18 at 19:05
I tired full outer join with not equal conditions but it is running for ever so looking for better solution — Saba, Mar 14 '18 at 19:06
In the question you say you are not worried about performance. Post what you tried so we can debug it. — Tab Alleman, Mar 14 '18 at 19:07
Edited my question and posted what I tried. Removed the comment about the performance, sorry for the confusion. — Saba, Mar 14 '18 at 19:32
@TabAlleman I tried with SSIS as well but MergeJoin expects sorted dataset and doesnt support to use OR condition. — Saba, Mar 14 '18 at 19:41

Tab Alleman · Accepted Answer · 2018-03-14T19:53:06.110

Your existing query is the right approach. There are some small changes you can make to improve it. Your index for each table should be on k1, k2, val:

EDIT (my original NULL handling was not correct. The correct approach appears long-winded, but is probably the most efficient solution that is logically correct):

SELECT a.*, c.*, e.*
FROM @ab a
FULL OUTER JOIN @cd c   ON  a.k1 = c.k1
                        AND a.k2 = c.k2
FULL OUTER JOIN @ef e   ON  (c.k1 = e.k1
                        AND c.k2 = e.k2 ) 
                        --OR (a.k1 = e.k1    --This condition is not needed and will only slow performance
                        --AND a.k2 = e.k2 )       
WHERE   (a.k1 IS NULL OR c.k1 IS NULL OR e.k1 IS NULL)
--OR      (ISNULL(a.val, '') != ISNULL(c.val, ''))     --Wrapping the val columns in ISNULL prevents the indexes from being used
--OR      (ISNULL(c.val, '') != ISNULL(e.val, ''))
--OR      (ISNULL(a.val, '') != ISNULL(e.val, ''))
OR      ((a.val != c.val) OR (a.val IS NULL AND c.val IS NOT NULL) OR (a.val IS NOT NULL AND c.val IS NULL))
OR      ((a.val != e.val) OR (a.val IS NULL AND e.val IS NOT NULL) OR (a.val IS NOT NULL AND e.val IS NULL))
OR      ((e.val != c.val) OR (e.val IS NULL AND c.val IS NOT NULL) OR (e.val IS NOT NULL AND c.val IS NULL))

When you need to compare nullable columns, it may feel more elegant to compare ISNULL() results, but inline functions prevent the query engine from using indexes, forcing table scans, which is the worst thing you can do for performance.

Thank you for your valuable suggestions, removing the ISNULLs. If I remove "OR (a.k1 = e.k1 AND a.k2 = e.k2 )" then table ef (3, 3) is not aligned properly with ab (3, 3), getting two different rows for 3, 3. — Saba, Mar 14 '18 at 20:00

score 0 · Answer 2 · answered Mar 14 '18 at 19:30

0

Would something like this work for you?

SELECT Z.k1, Z.k2, Z.val, Y.k1, Y.k2, Y.val, Y.add_cd, X.k1, X.k2, X.val, X.add_ef
FROM @ab AS Z 
FULL OUTER JOIN @cd AS Y ON Z.k1 = Y.k1 AND Z.k2 = Y.k2
FULL OUTER JOIN @ef AS X ON X.k1 = Y.k1 AND X.k2 = Y.k2
WHERE NOT EXISTS (
    SELECT A.k1, A.k2, A.val, C.k1, C.k2, C.val, C.add_cd, E.k1, E.k2, E.val, E.add_ef
    FROM @ab AS A
    INNER JOIN @cd AS C ON A.k1 = C.k1 AND A.k2 = C.k2 AND A.val = C.val
    INNER JOIN @ef AS E ON C.k1 = E.k1 AND C.k2 = E.k2 AND C.val = E.val
    WHERE Z.k1 = A.k1 AND Z.k2 = A.k2 AND Y.k1 = C.k1 AND Y.k2 = C.k2 AND X.k1 = E.k1 AND X.k2 = E.k2
)

I'm worried there may be nuances with your NULLS that are or are not compared the way you want them to be...

answered Mar 14 '18 at 19:30

Sturgus

666
4
18

Thank you for your quick response. Please see the query I tried in the question that gives the result for small amount of data but takes forever for the millions. Can you please help me with the performance? – Saba Mar 14 '18 at 19:38
Well, I see this query performing better than your version. FULL OUTER JOINs on millions of records are always going to be slow though. Try comparing the speed of mine vs yours and reporting back. – Sturgus Mar 14 '18 at 19:40
Something in the joins is off - the `k1` \ `k2` values of 3\3 (on tables `@ab` and `@ef`) are showing on 2 different records in the output of this query – tarheel Mar 14 '18 at 19:41
Yes, I see what you mean now. It's showing ab's 3, 3, c on a different row than ef's 3, 3, d. – Sturgus Mar 14 '18 at 19:52

score 0 · Answer 3 · answered Mar 14 '18 at 19:38

I think you're going down the right path with using a full outer join, just need to make the where clause work for ya. Might not be the most efficient answer, but will do the trick.

select *
from @ab as ab
full outer join @cd as cd on ab.k1 = cd.k1
                         and ab.k2 = cd.k2
full outer join @ef as ef on ab.k1 = ef.k1
                         and ab.k2 = ef.k2
where (
        isnull(ab.val, 'X') <> isnull(cd.val, 'XX')
        or
        isnull(ab.val, 'X') <> isnull(ef.val, 'XX')
        or
        isnull(cd.val, 'X') <> isnull(ef.val, 'XX')
        or
        coalesce(ab.val, cd.val, ef.val) is NULL
    )
order by coalesce(ab.k1, cd.k1, ef.k1)
, coalesce(ab.k2, cd.k2, ef.k2)

The parenthesis are around the entire where clause just in case you ever add another constraint (don't want the compiler confusing and/or because of syntax). And the order by clause is only to help match the order of the expected output shown in the question.

Data discrepancies/comparison between three tables - SQL Server 2014

3 Answers3