SQL Deduplicate List of Tuples

Question

I have a table with two columns of IDs, like so:

╔════════╦══════╗
║ Master ║ Dupe ║
╠════════╬══════╣
║ 2      ║ 7    ║
║ 3      ║ 6    ║
║ 6      ║ 7    ║
║ 20     ║ 25   ║
║ 75     ║ 25   ║
╚════════╩══════╝

Each row represents the IDs of two rows in a sql table that are deemed to be duplicates of each other.

This table can contain many thousands of entries, with no guarantees for the data other than the Master column being sorted in ascending order as pictured. Either column could contain the same ID as the other column, potentially against different or the same partner ID. Again - no guarantees.

From this table I would like get an index of Master and all of its possible dupes. Like pictured below.

Desired outcomes:

Lowest ID should be kept as master
All subsequent dupes of a dupe should map back to the same (lowest ID) master

For the above, the desired output would look like so (but the columns don't HAVE to be sorted):

╔════════╦══════╗
║ Master ║ Dupe ║
╠════════╬══════╣
║ 2      ║ 3    ║
║ 2      ║ 6    ║
║ 2      ║ 7    ║
║ 20     ║ 25   ║
║ 20     ║ 75   ║
╚════════╩══════╝

I'm finding it difficult to explain this problem so my googling has not returned much. I'm thinking there must be an algorithm somewhere for iterating through a list of tuples like this and discovering duplication.

Any help appreciated!

EDIT: I've modified the example tables to better explain what their contents might look like.

Some notes to consider,

There is no guarantee of a chain. It could all be one big chain, lots of small ones or none at all.
There is no guarantee that all pairs appear in reverse order somewhere else in the table

From what I can see, the problem looks to be recursive, I think LukStorms is on the right track but I can't quite figure it out

ANSWERED: While both solutions below from @artm and @LukStorms seem to work, I found the latter to be a little more succinct and readable. Thank you both! Fantastic help on a tough question. I only wish I could award the answer to you both

Can you better explain your logic? What is the nature of the relationship between 2 and 3, as it appears in your result set, from the original table? — Tim Biegeleisen, Jun 13 '17 at 06:59
Sure, 3 and 6 are dupes, 6 and 7 are dupes, 7 and 2 are dupes. Keeping the lowest id of the set (2), ID's 3, 6 and 7 are all dupes of 2. — Sean Missingham, Jun 13 '17 at 07:50

artm · Answer 1 · 2017-06-14T23:41:15.877

Try this. Get the min of master from your table with a CTE and cross join to all other values in the table.

;WITH minmaster as (select MIN(MASTER) master
FROM myTable)
select distinct m.master
, i.dupe
from minmaster m 
cross join (select dupe dupe from myTable union all select master from myTable) i
WHERE i.dupe <> m.master

Update:

After your edit with more rows this below works although I'm not sure if it's the best solution. Logic was start with the first master dupe (since data is sorted by master), if the dupe exists on the 2nd column where first column is not equal to current master, then take the same master, otherwise take the next master. It's hard to explain, someone else can probably find an easier solution.

;WITH myTable AS 
(SELECT 2 MASTER, 7 dupe
UNION all SELECT 3, 6
UNION all SELECT 6, 7
UNION all SELECT 20, 25
UNION all SELECT 75, 25
UNION all SELECT 100, 125
UNION all SELECT 150, 300
UNION all SELECT 180, 300
)
, cte AS 
(
SELECT m.master L, m.dupe R, ROW_NUMBER() OVER (ORDER BY master) rnkC
FROM myTable m
)
, cte2 AS 
(
SELECT m.master L, m.dupe R, ROW_NUMBER() OVER (ORDER BY master) rnkC2
FROM myTable m
)
, cteCur AS 
(
SELECT TOP 1 cte.l, cte.R, cte.rnkC
FROM cte
UNION ALL
SELECT 
CASE WHEN cteCur.r IN (SELECT dupe 
                        FROM myTable 
                        WHERE MASTER <> cteCur.L AND dupe = cteCur.R) 
    THEN cteCur.L 
    ELSE (SELECT cte2.L 
            FROM cte2 
            WHERE cte2.rnkC2 = cteCur.rnkC + 1) 
    END
, CASE WHEN cteCur.r IN (SELECT dupe 
                            FROM myTable 
                            WHERE MASTER <> cteCur.L AND dupe = cteCur.R) 
        THEN (SELECT cte2.L 
                FROM cte2 
                WHERE cte2.R = cteCur.R AND cte2.L <> cteCur.L) 
        ELSE (SELECT cte2.R 
                FROM cte2 
                WHERE cte2.rnkC2 = cteCur.rnkC + 1) 
        END
, cteCur.rnkC + 1
FROM cteCur
WHERE cteCur.L IS NOT NULL
)
SELECT cteCur.L Master
, cteCur.R Dupe
FROM cteCur
WHERE L IS NOT NULL
ORDER BY L, R

@artm You might want to review the order by of those row_number(). When a table is used in that first CTE, then the `(select 1)` won't give the correct order, so the result could be different. — LukStorms, Jun 14 '17 at 11:11
@LukStorms You're right, thanks. I changed it to order by master. — artm, Jun 14 '17 at 23:41
Thank you both!! I've edited the original post and marked an answer` — Sean Missingham, Jun 15 '17 at 01:03

LukStorms · Accepted Answer · 2017-06-15T09:53:44.503

Here's an example that uses a recursive CTE to connect those duplicates.

But to make sure that duplicates are all in both directions, the DUPES CTE is used.

declare @DuplicateTest table (Master int, Dupe int);

insert into @DuplicateTest (Master, Dupe) values 
(3,6),(6,7),(2,7),
(20,25),(75,25);

;with DUPES as
(
     select distinct Master as Dupe1, Dupe as Dupe2 from @DuplicateTest
     union
     select distinct Dupe, Master from @DuplicateTest
)
,RCTE as
(
   select Dupe1 as Base, 0 as Level, Dupe1, Dupe2
   from DUPES

   union all

   select r.Base, (r.Level + 1), d.Dupe1, d.Dupe2
   from RCTE r
   join DUPES d on (r.Dupe2 = d.Dupe1 
                    and r.Dupe1 != d.Dupe2 -- don't loop on the reverse
                    and r.Base != d.Dupe2 -- don't repeat what we started from
                    and r.Level < 100) -- if the level gets to big it's most likely a loop
)
select min(Dupe2) as Master, Base as Dupe
from RCTE
group by Base
having Base > min(Dupe2)
order by Base;

I like where you're going with the RCTE, but the process seems to assume that the whole thing is one chain and that it's circular in that it closes with the original pair reversed. If you took out the `7,2` pair at the bottom, it no longer works whereas the result should still be the same. Please have a look at the modified examples if you would :) — Sean Missingham, Jun 14 '17 at 01:34
@SeanMissingham Indeed, there was that assumption based on the previous test data. Wouldn't see it as 'one chain' though, it's more like getting branches. The answer has been updated. — LukStorms, Jun 14 '17 at 08:34

ntg · Answer 3 · 2017-06-21T14:58:54.343

1

Coming late to the party, but what you seem to want is to find disjoined sets. If you care about efficiency, there is an extremely fast algorithm for this, and it involves the data structure called UnionFind. It seems to be faster than even sorting...

Googling for SQL implementations, I was lead there

edited Jun 21 '17 at 14:58

answered Jun 21 '17 at 14:51

ntg

12,950
7
74
95

SQL Deduplicate List of Tuples

3 Answers3