SQL Find duplicate sets

Question

given this Schema:

 table tblSET 
    SetID int  PK
    SetName nvarchar(100) 


 Table tblSetItem 
    SetID int  PK
    ItemID int PK

tblSetItem.SetID is a FK into the tblSet Table.

Some data:

tblSet

 SetID    SetName
     1    Red
     2    Blue
     3    Maroon
     4    Yellow
     5    Sky

tblSetItem

 SetID    ItemID
     1       100
     1       101
     2       100
     2       108
     2       109
     3       100
     3       101
     4       101
     4       108
     4       109
     4       110
     5       100
     5       108
     5       109

I'd like a way to identify which sets contain the same items. In the example above Red and Maroon contain the same items (100,101) and Blue and Sky contain the same values (100,108,109)

Is there a sql query which would provide this answer?

I misread and was thinking he wanted something different. – Mike Cheel Nov 07 '13 at 16:08 — Mike Cheel, Nov 07 '13 at 16:08

score 2 · Accepted Answer · edited May 23 '17 at 10:32

You can use xml support to create a comma separated list (cf this answer: https://stackoverflow.com/a/1785923/215752). For this case I don't care about the form so I leave the starting comma in.

Note, I couldn't test this right now so I might have a typo...

select * from
(
  select SetID, setitemuniquestring,
         count(*) OVER (PARTITION BY setitemuniquestring) as cnt
  from
  (
    select S.SetID,
      (select ',' + I.ItemID
       from tblSetItem I
       where S.SetID = I.SetID
       order by u.ItemID ASC
       for xml path('')
      ) as setitemuniquestring
    from tblSet S
    group by S.SetID
  ) sub
) sub2
where cnt > 2

LittleSweetSeas · Answer 2 · 2013-11-07T16:08:08.523

1

I do assume that you need to identify sets which contents is exactly the same. So i would go for a temp table, where to store an "hash" of the contained item. Hash may be as simple as a list of item ids comma separated.

Eg.

Set Hash
1   100,101
2   100,108,109
3   100,101
4   101,108,109,110
5   100,108,109

Then, you simply neet to select on such a temp table grouping by hash value

Eg. Duplicate sets only:

Count Hash
2     100,101
2     100,108,109

So, resuming:

populate temp table using an xml path function to join item ids (remember to get an ordered list of item ids)
select duplicate sets on temp table by counting rows grouping by hash
apply any subsequent form of logic on your duplicate sets

edited Nov 07 '13 at 16:08

answered Nov 07 '13 at 15:46

LittleSweetSeas

6,786
2
21
26

You don't need (and shouldn't use) a cursor to populate your temp table – Hogan Nov 07 '13 at 15:50
@Hogan please explain more. I know cursor are a bad habit, but I can't see another way to do this. I would be happy to learn more from you – LittleSweetSeas Nov 07 '13 at 15:51
Awesome, you just made my day :) I just updated my answer with your hint – LittleSweetSeas Nov 07 '13 at 16:07

score 1 · Answer 3 · answered Aug 28 '19 at 17:11

A simple solution is to count the total items for each set and do a self join to count the number of items that are common between sets. Then choose only those pairs of sets where their totals and the size of the intersection are the same.

SELECT * 
  FROM (SELECT a_id, b_id, a.cnt 
          FROM (SELECT SetId as a_id, count(*) as cnt 
                  FROM tblSetItem GROUP BY SetId) as a, 
               (SELECT SetId as b_id, count(*) as cnt 
                  FROM tblSetItem GROUP BY SetId) as b 
          WHERE a.cnt=b.cnt AND a_id!=b_id) as totals_match
  NATURAL JOIN 
        (SELECT a.SetId as a_id, b.SetId as b_id, count(*) as cnt 
           FROM tblSetItem a, tblSetItem b 
          WHERE a.SetId != b.SetId and a.ItemId=b.ItemId) as items_match

score 0 · Answer 4 · answered Nov 07 '13 at 15:51

You can do this with a single query. The way I approach it is to create all pairs of sets. Then join in the items twice, once on each side, using outer joins.

Then, aggregate by the two set ids. The items are the same when there are no nulls on either side. There are different ways to check this. The following uses count(*) for that check:

select pairs.SetId1, pairs.SetId2
from (select s.SetId as SetId1, s2.SetId as SetId2
      from tblSetItem s cross join tblSetItem s2
     ) pairs left outer join
     tblSetItem si1
     on pairs.SetId1 = si1.SetId1 full outer join
     tblSetItem si2
     on pairs.SetId2 = si2.SetId2 and si2.ItemId = si1.ItemId
group by pairs.SetId1, pairs.SetId2
having count(si1.SetId) = count(si2.Setid) and
       count(si1.SetId) = count(*)

This is an interesting approach but I couldn't seem to get it to work. — Aheho, Nov 07 '13 at 17:03

SQL Find duplicate sets

4 Answers4