-1

Hi I'm currently trying to create a query to group customers together.

I'm currently wanting to group by "Phone Number", "Email" and maybe some other fields in the future.

My problem is that I may want to group rows which aren't directly linked but have mutual rows.

For example, in this diagram we can see that Customer 2 and Customer 5 are in no way related at all, however they both share connections with Customer 1 which would then allow them to group together.

Venn Diagram Describing my ideal Data Set Venn Diagram Describing my ideal Data Set

Here is another View: CustomerId | FullName | PhoneNumber | Email --|--|--|--------------------------------------------------------------------------- 1 | Bill Smith | 01612345678 | BillS@SqlTeam.com 2 | Kelly Smith | 01612345678 | KellyS@SqlTeam.com 3 | Kevin Roach | 07111111111 | KevinR@Example.com 4 | Chris Ronald | 07222222222 | ChrisR@Blog.co.uk 5 | Bill Smith | 07987654321 | BillS@SqlTeam.com

We can see that Customer 2 is connected to Customer 1 through a mutual Phone Number, and Customer 5 is connected to 1 through a mutual Email Address. However Customer 2 and Customer 5 share no information which would've originally grouped them.

My reasoning for needing this functionality is that one person / household could have multiple customer accounts all with differing information therefore I am trying to group them up as best as I can with as much mutual information as possible.

I have already tried to use DENSE_RANK() and GROUP BY, however these seem the additionally partition the groups when adding more than 1 column, I want a group to be created solely on the fact that a data field matches another.

bansi
  • 55,591
  • 6
  • 41
  • 52

1 Answers1

0

I've adopted the excellent answer from How can I combine Group Identifiers into Single Group? by GMB to your issue:

SELECT  *
INTO    #data
FROM    (
    VALUES  (1, N'Bill Smith', N'01612345678', N'BillS@SqlTeam.com', NULL)
    ,   (2, N'Kelly Smith', N'01612345678', N'KellyS@SqlTeam.com', NULL)
    ,   (3, N'Kevin Roach', N'07111111111', N'KevinR@Example.com', '12345')
    ,   (4, N'Chris Ronald', N'07222222222', N'ChrisR@Blog.co.uk', NULL)
    ,   (5, N'Bill Smith', N'07987654321', N'BillS@SqlTeam.com', NULL)
    ,   (6, N'Gill Smith', N'07987654322', NULL, 12345)
    ,   (7, N'Gill McGill', N'07987654322', NULL, NULL)
    ,   (8, N'Gillian Smith', NULL, NULL, 12345)
    ,   (9, N'Smith W', NULL, 'KellyS@SqlTeam.comx', 12345)
) t (CustomerId,FullName,PhoneNumber,Email, Fax)

;WITH edges AS (
    SELECT  d.customerid, d2.customerid AS custIdTo
    FROM    #data d
    INNER JOIN #data d2
        ON  (
            d2.PhoneNumber = d.PhoneNumber
        OR  d2.email = d.email
        OR  d2.fax = d.fax
        )
        --AND   d2.customerid <> d.customerid -- remove unmatched customers by uncommenting this row
    ) 
, rec AS (
    SELECT  customerid, custidto, cast(concat(customerid, ',', custidto) AS nvarchar(max)) AS path
    FROM    edges
    UNION ALL
    SELECT  c.customerid, e.custidto, concat(path, ',', e.custidto)
    FROM    rec c
    INNER JOIN edges e
        ON  e.customerid = c.custidto
    WHERE   ',' + path + ',' NOT LIKE '%,' + cast(e.custidto AS nvarchar(max)) + ',%'
    
    )
SELECT  customerid, CAST(MIN(x.value) AS INT) AS baseCustomer
FROM    rec 
CROSS apply string_split(path, ',') x
GROUP BY customerid
option(maxrecursion 0);

As a sample, i've added a fax field, and the customers are matched by either phone, email or fax.

It's a bit complicated to explain how it works.

First i create some sample data and put it into #data table.

Then we do two CTEs:

  • edges contains all the matches between customers.
  • a recursing cte called rec works by first selecting all the matches. Then for each match, it keeps the original customerid, and then checks which other matches exist between the customerIdTo and its own matches.
  • the cte will build a thing called path, which is the "road" it took from each match, by doing a concat between custIdFrom, "," and custIdTo.

For example if we have following edge data:

  from to
  1  2
  2  3
  3  1

it will take 1 => 2, and then look for 2's match in the edge data, and found 2 => 3. Then it will look for 3's matches and find 3 => 1.

So for customer 1, it will create three rows, 1 => 2, 1 => 3, 3 => 1, but then it will check so it doesn't loop back to same row already done, so 3 => 1 will be discarded because 1 it was already in 1 => 2 path.

This check is performed here: WHERE ',' + path + ',' NOT LIKE '%,' + cast(e.custidto AS nvarchar(max))

If we don't do this check, the loop will continue forever which isn't great.

Finally, we have a lot of rows with customer and the path to each node. To get the matching node, we do a string_split and get the minimum "custIdTo" value for each customerId

EDIT: The performance of these things ain't great because of the sheer amount of potential loops and joins. An alternative could be something like this: https://stackoverflow.com/a/75560065/13061224 which is a bit more trivial but means loops

siggemannen
  • 3,884
  • 2
  • 6
  • 24