How to Find Rows which are Duplicates by a Key but Not Duplicates in All Columns?

Question

I am working with a table which is an extract of a set of other tables. All of the rows of the extract table should be unique according to keys D1, D2 and D3. They are not. It appears that an earlier developer attempted to solve this problem by using a SELECT DISTINCT across all columns being queried from this table. This will work, but only if every row which is a duplicate on (D1, D2, D3) is also a duplicate across the non-key columns (ignoring the IDENTITY column that was added to the extract table).

In other words, given rows as follows:

D1  D2  D3  C4  C5  C6
=== === === === === ===
A   B   C   X1  X2  X3
A   B   C   X1  X2  X3

then

SELECT DISTINCT D1, D2, D3, C4, C5, C6
FROM BAD_TABLE

will "work", as there's no difference between the rows which are duplicated on (D1,D2,D3). But if the table contained

D1  D2  D3  C4  C5  C6
=== === === === === ===
A   B   C   X1  X2  X3
A   B   C   X1  X2  X4

then SELECT DISTINCT would return two rows for the key (A,B,C). Furthermore, we would have to decide which of X3 or X4 was the "correct" value.

I know how to find the duplicates on (D1,D2,D3). I even know how to find the duplicates across all the columns (other than the IDENTITY column):

;
WITH DUPLICATES(D1,D2,D3) AS
(
    SELECT D1, D2, D3
    FROM SOURCE
    GROUP BY D1, D2, D3
    HAVING COUNT(*)>1
)
SELECT S.D1, S.D2, S.D3, S.C4, S.C5, S.C6
FROM SOURCE S
INNER JOIN DUPLICATES D
    ON S.D1 = D.D1 AND S.D2 = D.D2 AND S.D3 = D.D3
ORDER BY S.D1, S.D2, S.D3, S.C4, S.C5, S.C6

The question is, how do I find the subset of the above resultset which are duplicates on (D1,D2,D3), but not duplicates on (D1,D2,D3,C4,C5,C6)?

What's up with downvoting the question? How can I improve the question, if you don't say what the problem is? — John Saunders, Sep 10 '11 at 18:39
StackOverflow is becoming damaged by a small subset of high-reputation users who feel it is their right to march around downvoting without careful reading. Just my 2c. — Dan Nissenbaum, Jan 26 '13 at 07:00
How high does your rep have to be in order to downvote without comment? — John Saunders, Jan 26 '13 at 08:06
@JohnSaunders a couple of things. (1) people don't have to comment when down-voting, it has nothing to do with reputation except [the minimum reputation to down-vote at all (which only requires 125)](http://stackoverflow.com/privileges/vote-down). This isn't going to change [as you can see by this debate (and many like it)](http://meta.stackexchange.com/questions/135/encouraging-people-to-explain-down-votes). (2) If down-votes happened almost a year after you posted your question, perhaps the problem is that you haven't accepted an answer. Did none of the answers below help solve the problem? — Aaron Bertrand, Feb 01 '13 at 16:19
In other words, you asked this question over two years ago. Are you still struggling to solve the problem? — Aaron Bertrand, Feb 01 '13 at 16:20
Well, I'm not sure what question you're answering. But in any case I gave an up-vote because it is clear you put effort into the question. And I hope I have answered your other questions. — Aaron Bertrand, Feb 01 '13 at 16:25
@AaronBertrand: you've also given me a lunchtime "to do": reproduce this, test the solutions, and pick an accepted answer. — John Saunders, Feb 01 '13 at 16:39

score 4 · Answer 1 · edited May 23 '17 at 11:51

You can do it by joining the table on itself, saying the D's are all equal and at least one of the C's are not equal.

CREATE TABLE #Source (
    D1 VARCHAR(2),
    D2 VARCHAR(2),
    D3 VARCHAR(2),
    C4 VARCHAR(2),
    C5 VARCHAR(2),
    C6 VARCHAR(2) );

INSERT INTO #Source VALUES ('A', 'B', 'C', 'X1', 'X2', 'X3');
INSERT INTO #Source VALUES ('A', 'B', 'C', 'X1', 'X2', 'X4');
INSERT INTO #Source VALUES ('A', 'B', 'D', 'X1', 'X2', 'X3');
INSERT INTO #Source VALUES ('A', 'B', 'D', 'X1', 'X2', 'X3');

SELECT S1.D1, S1.D2, S1.D3, S1.C4 C4_1, S2.C4 C4_2, S1.C5 C5_1, S2.C5 C5_2, S1.C6 C6_1, S2.C6 C6_2
FROM
    #Source S1
            INNER JOIN
    #Source S2
            ON
        (       S1.D1 = S2.D1 
            AND S1.D2 = S2.D2
            AND S1.D3 = S2.D3
            AND (   S1.C4 <> S2.C4
                 OR S1.C5 <> S2.C5
                 OR S1.C6 <> S2.C6
                 )
        );

DROP TABLE #Source;

Gives the following results:

D1   D2   D3   C4_1 C4_2 C5_1 C5_2 C6_1 C6_2
---- ---- ---- ---- ---- ---- ---- ---- ----
A    B    C    X1   X1   X2   X2   X4   X3
A    B    C    X1   X1   X2   X2   X3   X4

Also note that this is compatible with MS SQL 2000 as you later indicated is required in How to Convert a SQL Query using Common Table Expressions to One Without (for SQL Server 2000).

my assignment for lunchtime is to try this out. I'm _still_ working with these same queries, though I've been permitted to do unit tests while fixing them (and TDD in fact). — John Saunders, Feb 18 '11 at 15:07

John Saunders · Answer 2 · 2010-11-22T20:27:52.963

I haven't had a chance to try Conrad's answer yet, but came up with one of my own. It's rather a "duh" moment.

So, if you want to find all the rows in set A except for those that are in set B, you use the EXCEPT operator:

; 
WITH KEYDUPLICATES(D1,D2,D3) AS 
( 
    SELECT D1, D2, D3 
    FROM SOURCE 
    GROUP BY D1, D2, D3 
    HAVING COUNT(*)>1 
),
KEYDUPLICATEROWS AS
( 
    SELECT S.D1, S.D2, S.D3, S.C4, S.C5, S.C6 
    FROM SOURCE S 
    INNER JOIN KEYDUPLICATES D 
        ON S.D1 = D.D1 AND S.D2 = D.D2 AND S.D3 = D.D3 
),
FULLDUPLICATES AS
(
    SELECT S.D1, S.D2, S.D3, S.C4, S.C5, S.C6 
    FROM SOURCE S
    GROUP BY S.D1, S.D2, S.D3, S.C4, S.C5, S.C6 
    HAVING COUNT(*)>1
)
SELECT KR.D1, KR.D2, KR.D3, KR.C4, KR.C5, KR.C6
FROM KEYDUPLICATEROWS AS KR
EXCEPT
SELECT FD.D1, FD.D2, FD.D3, FD.C4, FD.C5, FD.C6
FROM FULLDUPLICATES AS FD
ORDER BY D1, D2, D3, C4, C5, C6

This seems to be showing me 1500 rows which are duplicates across (D1,D2,D3), but which are only duplicates across a subset of (D1,D2,D3,C4,C5,C6). In fact, it appears they are duplicates across (D1,D2,D3,C4,C5).

How to confirm that will be the subject of another question.

A minor correction to the code above: The last line "ORDER BY D1, D2, D3, D4, C5, C6" should read "ORDER BY D1, D2, D3, C4, C5, C6" — RC_Cleland, Nov 22 '10 at 18:58
I couldn't find it stated explicitly that if a discrepancy exists in some of the non-key columns for any group of rows with same key values, then it will surely be just one row that differs from the others (or, for example, two single rows, if there are just two of them). Is that really so? I mean, could there be a situation like `(A B C X1 Y1 Z1), (A B C X1 Y1 Z1), (A B C X1 Y1 Z2), (A B C X1 Y1 Z2)`, i.e. four rows that only differ in the value of the last column, two containing `Z1` and the other two containing `Z2`? And my question is, should such inconsistencies be taken into account too? — Andriy M, Jun 03 '11 at 14:46

score 2 · Answer 3 · answered Feb 01 '13 at 19:13

I know this is an old question, but I saw activity on the question and the technique I always use for these is not presented here as an answer, and it's really quite simple, so I figured I'd present it.

SELECT D1, D2, D3, MIN(C4), MAX(C4), MIN(C5), MAX(C5), MIN(C6), MAX(C6)
FROM BAD_TABLE
GROUP BY D1, D2, D3
HAVING MIN(C4) <> MAX(C4)
    OR MIN(C5) <> MAX(C5)
    OR MIN(C6) <> MAX(C6)

This will show all the keys of duplicates on key but with differences on non-keys, with the range of differences duplicates.

To see all the rows within that, you would need to join back to BAD_TABLE as your example in the original question.

Conrad Frix · Accepted Answer · 2010-11-22T17:52:39.887

2

Any reason you don't just create another table expression to cover more fields and join to that one?

WITH DUPLICATEKEY(D1,D2,D3) AS
(
    SELECT D1, D2, D3
    FROM SOURCE
    GROUP BY D1, D2, D3
    HAVING COUNT(*)>1
)
WITH NODUPES(D1,D2,D3,C4,C5,C6) AS
(
SELECT 
S.D1, S.D2, S.D3, S.C4, S.C5, S.C6
FROM SOURCE S
GROUP BY
 S.D1, S.D2, S.D3, S.C4, S.C5, S.C6
HAVING COUNT(*)=1
)

SELECT S.D1, S.D2, S.D3, S.C4, S.C5, S.C6
FROM SOURCE S
INNER JOIN DUPLICATEKEY D
    ON S.D1 = D.D1 AND S.D2 = D.D2 AND S.D3 = D.D3

INNER JOIN NODUPES D2
    ON S.D1 = D2.D1 AND S.D2 = D2.D2 AND S.D3 = D2.D3

ORDER BY S.D1, S.D2, S.D3, S.C4, S.C5, S.C6

edited Nov 22 '10 at 17:52

answered Nov 22 '10 at 17:44

Conrad Frix

51,984
12
96
155

You're pretty close, but I wanted to see which were "key duplicates" but **not** "full duplicates". Sorry to not know the correct term for these. – John Saunders Nov 22 '10 at 17:50
@John Right of course. I updated the answer On the second CTE having Count(*) = 1 – Conrad Frix Nov 22 '10 at 17:53

orbfish · Answer 5 · 2010-11-22T22:59:15.230

2

This would have performance limitations, but is much easier to understand:

SELECT D1, D2, D3
FROM TEST_KEY TK
WHERE (D1, D2, D3) IN
        (SELECT D1, D2, D3 FROM TEST_KEY TK2
         GROUP BY D1, D2, D3
         HAVING COUNT(*) > 1)
  AND (D1, D2, D3) IN
        (SELECT D1, D2, D3 FROM TEST_KEY TK2
         GROUP BY D1, D2, D3, C4, C5, C6
         HAVING COUNT(*) < 2)

Unable to test on SQL-Server, hope the syntax is good.

Again, not sure if you have analytic functions in SQL-Server, but this one works in Oracle and might be faster:

WITH BAD_DUP AS (
SELECT TK.*,
       COUNT(1) OVER (PARTITION BY D1, D2, D3, C4, C5, C6 ORDER BY D1) FULL_DUP,
       COUNT(1) OVER (PARTITION BY D1, D2, D3 ORDER BY D1) KEY_DUP
FROM TEST_KEY TK)
SELECT * FROM BAD_DUP
WHERE FULL_DUP < KEY_DUP

Would like to get it down to a single query....

edited Nov 22 '10 at 22:59

answered Nov 22 '10 at 22:17

orbfish

7,381
14
58
75

Thanks. I'll look into that. Yes, we have the analytic functions (COUNT, ROW_COUNT, etc). – John Saunders Nov 22 '10 at 23:06
The important thing is being able to partition by the different groupings, so that you only have to make one pass - I assume there's a way to do this, I just don't know if the syntax is standard. – orbfish Nov 23 '10 at 17:43

How to Find Rows which are Duplicates by a Key but Not Duplicates in All Columns?

5 Answers5

Linked