0

I have customer data coming from several different systems, with three potential unique identifiers. We need to create a single new ID that can be used to link ANY match between the three identifiers.

The data is in the following structure:

Primary Key CustomerID1 CustomerID2 CustomerID3
1 Alpha Dog Jeans
2 Alpha Cat Shirt
3 Beta Dog Dress
4 Gamma Bear Jeans
5 Alpha Dog Jeans
6 Epsilon Bird Boots

In the end, keys 1-5 should all be linked based on the bolded matches while 6 should have its own ID. So the desired output would be a new ID generated for just 2 customers from this list and then a second table with all the IDs & IDTypes linked to that new ID (but open to changing the end result structure so long as we generate 1 ID for each customer):

New ID
Key1
Key2
New ID ID IDType
Key1 Alpha CustomerID1
Key1 Beta CustomerID1
Key1 Gamma CustomerID1
Key1 Dog CustomerID2
Key1 Cat CustomerID2
Key1 Bear CustomerID2
Key1 Jeans CustomerID3
Key1 Shirt CustomerID3
Key1 Dress CustomerID3
Key2 Epsilon CustomerID1
Key2 Bird CustomerID2
Key2 Boots CustomerID3

I've tried creating a list of all possible matches like the below but I'm not sure how to consolidate that down into a single ID for each customer.

DROP TABLE IF EXISTS #Test
CREATE TABLE #Test (PrimaryKey int, CustomerID1 varchar(15), CustomerID2 varchar(15), CustomerID3 varchar(15))
INSERT INTO #Test VALUES
     (1,'Alpha','Dog','Jeans')
    ,(2,'Alpha','Cat','Shirt')
    ,(3,'Beta','Dog','Dress')
    ,(4,'Gamma','Bear','Jeans')
    ,(5,'Alpha','Dog','Jeans')
    ,(6,'Epsilon','Bird','Boots')

SELECT
     t1.PrimaryKey
    ,t2.Primarykey
FROM #Test t1
JOIN #Test t2 on t2.PrimaryKey != t1.PrimaryKey and t1.CustomerID1 = t2.CustomerID1
UNION
SELECT
     t1.PrimaryKey
    ,t2.Primarykey
FROM #Test t1
JOIN #Test t2 on t2.PrimaryKey != t1.PrimaryKey and t1.CustomerID2 = t2.CustomerID2
UNION
SELECT
     t1.PrimaryKey
    ,t2.Primarykey
FROM #Test t1
JOIN #Test t2 on t2.PrimaryKey != t1.PrimaryKey and t1.CustomerID3 = t2.CustomerID3

I feel like the solution is obvious but I'm stuck so any help is appreciated! Thank you!

lbanker
  • 65
  • 3
  • What if you have {"Gamma", "Bird" "Boots"}? Would that match both record 4 or record 6 or both? Would that then transitively link the two previously separate sets? What if that record were first in the list? Unless you are dealing with well defined identifiers (SSN, employee ID, state + drivers license number) that you know will never be shared among multiple different entities, I see problems ahead. – T N Feb 24 '23 at 18:33
  • 1
    Good question! A match on any identifier would be a match to that record. The IDs are actually internal account numbers that would be unique & not shared (SystemName + SystemID). – lbanker Feb 24 '23 at 18:38

1 Answers1

1

It's a bit tricky to do in one select (at least for me). I usually do something like this:

SELECT  *, CAST(NULL AS INT) AS ID_To
INTO #t
FROM
(
    VALUES  (1, N'Alpha', N'Dog', N'Jeans')
    ,   (2, N'Alpha', N'Cat', N'Shirt')
    ,   (3, N'Beta', N'Dog', N'Dress')
    ,   (4, N'Gamma', N'Bear', N'Jeans')
    ,   (5, N'Alpha', N'Dog', N'Jeans')
    ,   (5, N'Alpha', N'Bonanza', N'Boots')
    ,   (6, N'Epsilon', N'Bird', N'Boots')
    ,   (7, N'zz', N'dog', N'Bird')
    ,   (8, N'zzz', N'bye', N'hi')
    ,   (9, N'zzzz', N'bear', N'hi ho silver')
) t (ID,CustomerID1,CustomerID2,CustomerID3)

WHILE @@rowcount > 0
BEGIN
    UPDATE  t2
    SET ID_to = ISNULL(t.ID_To, t.ID)
    FROM    #t t
    CROSS APPLY (
        VALUES (t.Customerid1), (t.Customerid2), (t.Customerid3)
        ) v(externalId) 
    CROSS JOIN #t t2
    CROSS APPLY (
        VALUES (t2.Customerid1), (t2.Customerid2), (t2.Customerid3)
        ) v2(externalId) 
    WHERE   ISNULL(t.id_To, t.id) < ISNULL(t2.ID_to, t2.id)
    AND v.externalId = v2.externalId
    AND t.ID <> t2.ID
END

SELECT  *
FROM    #t

SELECT  ISNULL(ID_TO, ID) AS groups
FROM    #t
GROUP BY ISNULL(ID_TO, ID)

To explain the code:

  1. I create two columns in your table containing row ID (ID) and eventual match ID (ID_to). This is so we can consolidate how every customer is matched against other customers
  2. Then i do a loop. In the loop i try to match the keys between two different customers. What i want to do is to match every customer against another customerID1,2,3 (AND v.externalId = v2.externalId )
  3. To make sure the loop finish i only want to match if the new id is lower than current one ISNULL(t.id_To, t.id) < ISNULL(t2.ID_to, t2.id). If the matching customer is already matched to someone else i take his ID_to. This ensure that if there's a long chain of matching IDs, we will follow the chain all the way.
  4. The WHILE @@ROWCOUNT > 0 trick is quite useful to loop stuff until there's something to do. But it's important to actually have a breaking condition otherwise the loop runs forever.

You can put the select inside the while loop to track what's going on with the #t-table

When the loop is done, every customer should either has ID_to = NULL, meaning he is the "master", or ID_to > 0 which means this customer is matched against another master customer

siggemannen
  • 3,884
  • 2
  • 6
  • 24
  • You are a genius - exactly what I was looking for & truly appreciate your explanation! – lbanker Feb 24 '23 at 18:25
  • Glad it's useful! Just watch out for bad data in your externalids. For example if they are empty there might be some strange matches since '' = "" – siggemannen Feb 24 '23 at 18:28