0

Question: How do I efficiently get a list of people (#People table below, one person per row) that are matched as a set (#Keys table below), but also make sure that the sets are unique.

Background: I'm working with sets of matches in a database (in the form of KeyId, PersonId1, PersonId2). We have an automatic method that flags people as duplicates and writes them to a Match table. There aren't too many, but we usually sit with about 100K match records and 200K people. The match process has also added duplicate records in the form of matching Person 1 with 2, and also matching 2 with 1. Also, we logically delete people records (IsDeleted = 1) so do not want to return matches where one person has already been deleted.

We have an administration screen where people can look at the duplicates and flag whether they aren't dupes, or delete one of the pair. There has been a problem where even if one person in the pair was deleted, the other one was still showing in the list. The SQL below is an attempt to make sure that only people that exist as a set are returned.

Test Data Setup:

CREATE TABLE #Keys
(
    KeyId int PRIMARY KEY,
    PersonId1 int,
    PersonId2 int
)

CREATE TABLE #People
(
    PersonId int PRIMARY KEY,
    Name varchar(150),
    IsDeleted bit
)

INSERT INTO #People
VALUES  (1, 'John',0),
        (2, 'Madeline',0),
        (3, 'Ralph',1),
        (4, 'Sarah',0),
        (5, 'Jack',0),
        (6, 'Olivia',0),
        (7, 'Ethan',0),
        (8, 'Sophia',0)

INSERT INTO #Keys
VALUES  (1,1,2),
        (2,2,3),
        (3,1,3),
        (4,2,1),
        (5,4,8),
        (6,3,7),
        (7,6,1)

SELECT *
FROM #Keys k
 JOIN #People p1
    ON k.PersonId1 = p1.PersonId
    AND p1.IsDeleted = 0
 JOIN #People p2
    ON k.PersonId2 = p2.PersonId
    AND p2.IsDeleted = 0

Returns:

KeyId   PersonId1   PersonId2   PersonId    Name    IsDeleted   PersonId    Name    IsDeleted
1   1   2   1   John    0   2   Madeline    0
4   2   1   2   Madeline    0   1   John    0
5   4   8   4   Sarah   0   8   Sophia  0
7   6   1   6   Olivia  0   1   John    0



SELECT KeyId, p1.PersonId, p1.Name
INTO #Results
FROM #Keys k
 JOIN #People p1
    ON k.PersonId1 = p1.PersonId
    AND p1.IsDeleted = 0
 JOIN #People p2
    ON k.PersonId2 = p2.PersonId
    AND p2.IsDeleted = 0

INSERT INTO #Results
SELECT KeyId, p2.PersonId, p2.Name
FROM #Keys k
 JOIN #People p1
    ON k.PersonId1 = p1.PersonId
    AND p1.IsDeleted = 0
 JOIN #People p2
    ON k.PersonId2 = p2.PersonId
    AND p2.IsDeleted = 0

SELECT * from #Results
order by KeyId

DROP TABLE #People
DROP TABLE #Keys
DROP TABLE #Results

The final query returns this set:

KeyId   PersonId    Name
1   2   Madeline
1   1   John
4   2   Madeline
4   1   John
5   8   Sophia
5   4   Sarah
7   6   Olivia
7   1   John

But it has the problem that Keys 1 and 4 have the same people, just reversed in order. The set I'd like returned is:

KeyId   PersonId    Name
1   2   Madeline
1   1   John
5   4   Sarah
5   8   Sophia
7   1   John
7   6   Olivia
Noel
  • 600
  • 16
  • 37
  • Is there a reason you cannot make a constraint based on the uniqueness of a key-value pair? You'd have to make sure your application code can handle the failure to insert the key value pair though http://stackoverflow.com/questions/15800250/add-unique-constraint-to-combination-of-two-columns – Marshall Tigerus Jun 09 '16 at 16:15

1 Answers1

0

First I would make

PersonId1 int,
PersonId2 int

The PK on #Keys and drop KeyId

A quick way to get the unique

select PersonId1, PersonId2 
from keys 
where PersonId1 < PersonId2 
union 
select PersonId2, PersonId1 
from keys 
where PersonId2 < PersonId1 

Clearly you would need to add in the join on deleted

You could also put a constraint on #Keys that PersonId1 < PersonId2

I think this will work also

select PersonId1, PersonId2 from keys 
except 
select PersonId2, PersonId1 from keys 
paparazzo
  • 44,497
  • 23
  • 105
  • 176
  • I need to keep the KeyId to show the links between the people downstream in the processing, but using your method, grouping on the personIds, and getting the MAX(KeyId) gives me the unique set I need. – Noel Jun 09 '16 at 19:22