How to Find Duplicates Over Multiple Columns Using SQL Server

Question

Is it possible to find duplicates in a SQL table across multiple columns in a way that only requires a match by one of the columns?

For example, lets say I have a table with the following Schema: ID, C1, C2

My goal is to return a new table with a column called "Group ID" which is the ID of the group the record lives. A given record should exist in the group if there is a match by at least one of C1 or C2.

For example, consider this table

Create Table #Target

(ID Varchar(MAX),C1 Varchar(MAX),C2 Varchar(MAX))

Insert INTO #Target

Values

('1','p1','e1'),
('2','p2','e2'),
('3','p1','e2'),
('4','p3','e3'),
('5','p3','e4'),
('6','p4','e4')

ID	C1	C2
1	p1	e1
2	p2	e2
3	p1	e2
4	p3	e3
5	p3	e4
6	p4	e4

The desired output would look something like

ID	C1	C2	GID
1	p1	e1	G1
2	p2	e2	G1
3	p1	e2	G1
4	p3	e3	G2
5	p3	e4	G2
6	p4	e4	G2

When trying to come up with a solution I tried grouping by each column individually like this:

#1 Group By C1 and Assign Unique ID (C1GID) to each Group

#2 Group By C2 and Assign Unique ID (C2GID) to each Group

Now the problem I am having is how to group by C1GID OR C2GID. I don't know how to group rows that share at least one of these column values in common.

Update: I am getting closer. I am able to generate a set of IDs that have a match by at least one of the columns using this

SELECT *, CAST(NULL AS INT) AS ID_To INTO #t FROM ( VALUES ('1','p1','e1'), ('2','p2','e2'), ('3','p1','e2'), ('4','p3','e3'), ('5','p3','e4'), ('6','p4','e4') ) t (ID,C1,C2)

Select ID1, STRING_AGG(ID3, ', ') + ',' + STRING_AGG(ID2, ', ') as Groups
FROM (
    select 
    t1.ID as ID1,
    t1.C1 as t1C1,
    t1.C2 as t1C2,

    t2.ID as ID2,
    t2.C1 as t2C1,
    t2.C2 as t2C2,

    t3.ID as ID3,
    t3.C1 as t3C1,
    t3.C2 as t3C2
    from #t t1
    LEFT JOIN (
        Select * From #t
    ) t2 ON t1.C1 = t2.C1 
    LEFT JOIN (
        Select * From #t
    ) t3 ON t1.C2 = t3.C2
    WHERE t1.C1 = t2.C1 OR t1.C2 = t3.C2
) Groups 
GROUP BY ID1

Image here: Output

I am trying to find a way to re-group these by finding an intersection between the values in the Groups column. For example its clear from the result that 1,2,3 belong together, and 4,5,6 belong together because their groups column has a common ID.

The answer is Yss; however to really help you post what you have tried, expected results and a sample of data you are working with. — Mark Schultheiss, Mar 26 '23 at 00:56
While asking a question, you need to provide a [minimal reproducible example](https://stackoverflow.com/help/minimal-reproducible-example): (1) DDL and sample data population, i.e. CREATE table(s) plus INSERT T-SQL statements. (2) What you need to do, i.e. logic and your code attempt implementation of it in T-SQL. (3) Desired output, based on the sample data in the #1 above. (4) Your SQL Server version (SELECT @@version;). — Yitzhak Khabinsky, Mar 26 '23 at 01:02
Thanks all, I just added to the question. Let me know if more is needed to make this pass requirements. — Riley Cohen, Mar 26 '23 at 01:25
This is known as the [graph connected components](https://en.wikipedia.org/wiki/Component_(graph_theory)) problem. You may wish to read [the paper describing SQL solution](https://arxiv.org/pdf/1802.09478.pdf) — Serg, Mar 26 '23 at 06:28
Does this answer your question? [Creating a single ID for any combination of matches between three different identifiers](https://stackoverflow.com/questions/75559657/creating-a-single-id-for-any-combination-of-matches-between-three-different-iden) — siggemannen, Mar 26 '23 at 13:35
@siggemannen I think it's really close although I can't get it to work for my use case. The solution groups all rows into one single group. I tried adjusting the conditional statement a bit but ended up with what seems to be an infinite loop. — Riley Cohen, Mar 26 '23 at 15:55
@lptr It seems your proposed solution does not compile. When I adjusted it, it appears to group by one column which is not exactly what im looking for. For example row 2 and row3 end up in different groups despite having a matching C2. — Riley Cohen, Mar 26 '23 at 15:55
`Cast(concat(c.pth,v.vrl,',') as varchar(max))`.. https://dbfiddle.uk/fP8volbL — lptr, Mar 26 '23 at 17:07
Strange, i put your exact data into the answer script and just removed Customer3 and it worked exactly as your expected result — siggemannen, Mar 26 '23 at 17:38
Perhaps your data have some anomalies as mentioned in the issue — siggemannen, Mar 26 '23 at 17:39
@siggemannen I just grabbed the code from the link you provided and that worked! I also was able to apply it to my exact dataset over a small sample to verify. One note though, the run time is very slow. It takes 23 seconds for a sample of 10 from my target set of Phone Numbers and Email Addresses. I am guessing the lengths have something to do with the slowed run time. The entire set is 4000 rows the query was executing for over 5 minutes before I cancelled the query. I'm researching to see if I can speed it up some more. — Riley Cohen, Mar 26 '23 at 17:52
No empty phone/emails. I made sure to fill them all in with the unique Id of the record. Total query time is 7min over the 4000 records. — Riley Cohen, Mar 26 '23 at 18:59
Yeah, it ain't no rocket, but not surprising perhaps, since we're basically doing a cross join x number of loops. I saw one thing though, LEFT JOIN #t tTo ON tTo.ID = t.ID_To could be removed since it doesn't affect the operation — siggemannen, Mar 26 '23 at 19:17

p3consulting · Answer 1 · 2023-03-28T05:43:39.147

Try this one:

with firstc1s(gid, id,c1) as (
    select row_number() over(order by id) as gid, id, c1 from (
        select min(id) as id, c1
        from #Target 
        group by c1
    ) x
)
,cte(lvl, gid, id, c1, c2, p) as (
    select 1, f.gid, f.id, d.c1, d.c2, concat('/', f.id, ' ')
    from firstc1s f
        join #Target d on d.id = f.id
        
    union all
    
    select lvl+1, c.gid, d.id, d.c1, d.c2, concat(c.p , '/' , d.id, ' ')
    from cte c
    join #Target d on  
        (
            c.c1 = d.c1 or c.c1 = d.c2
            or 
            c.c2 = d.c1 or c.c2 = d.c2
        )
    where charindex(concat('/' , d.id, ' '), c.p) = 0
)
select 
    dense_rank() over(order by gid) as gid, id,c1,c2 
from (
    select min(gid) as gid, id, c1, c2 from cte
    group by id, c1, c2
) x
order by gid, id, c1, c2
;

https://dbfiddle.uk/-nIlBCer

Right and easy to fix: add a space after the d.id both in the building of the path and in the charindex. — p3consulting, Mar 27 '23 at 16:23

How to Find Duplicates Over Multiple Columns Using SQL Server

1 Answers1