0

I'm working on a requirement where I need to match a set of records in a group (G1) on some fields, and re-group the matching records into unique new groups (NG1, NG2...). The requirement is something like below:

Sample Data

DECLARE @table TABLE ([Group] varchar(3), Member varchar(3), Address varchar(3), Phone varchar(3), Email varchar(3)) 

insert @table values
('G1', 'M1', 'A1', 'P1', 'E1'),
('G1', 'M2', 'A2', 'P2', 'E2'),
('G1', 'M3', 'A1', 'P3', 'E1'),
('G1', 'M4', 'A4', 'P3', 'E4'),
('G1', 'M5', 'A5', 'P5', 'E2'),
('G1', 'M6', 'A6', 'P6', 'E6'),
('G1', 'M7', 'A7', 'P6', 'E7'),
('G1', 'M8', 'A8', 'P8', 'E4'),
('G1', 'M9', 'A9', 'P9', 'E7'),
('G1', 'M10', 'A10', 'P10', 'E10')

In the attached sample data, M1, M3, M4, and M8 should come into same group as M1, M3 matches on Address and Email; M3 in turn matches with M4 on Phone; which in turn matches with M8 on Email. ie, they are related by one or many of the attributes.

Likewise, M6,M7, and M9 should be in another unique group ; and M2,M5 in same group (Email match).

M10 alone will be in a group as it doesn't have any matching records.

Like G1, there would be different main groups.

Can anyone please help? NOTE: I'm using MS SQL Server

Daniel Brughera
  • 1,641
  • 1
  • 7
  • 14
Dave
  • 29
  • 1
  • 7
  • 4
    . . I removed the incompatible database tags. Please tag only with the database you are really using. – Gordon Linoff Mar 26 '19 at 11:21
  • This is potentially horrible from a performance perspective. So how many records are you handling? Given that number, what is the expected (or acceptable) elapsed time? Is this a one-off exercise or a repeatable task? Are you looking for a pure SQL solution or would a procedural approach be acceptable? What should happen if say M23 matches with M3 on EMAIL but matches with M10 on PHONE (i.e. different attributes match to different main groups)? – APC Mar 26 '19 at 11:30
  • Most people prefer sample data as text, not images –  Mar 26 '19 at 11:44
  • @APC, Thank you. It's a repeatable task. There will be several groups and we will have to do it for each group. The total record count across all groups could be millions. Yes, a procedural approach is acceptable. The scenario you mentioned is possible. If M2 matches with M9 on Address and with M10 on Phone, M2, M9, M10 should be in same group – Dave Mar 26 '19 at 19:20
  • *"M2, M9, M10 should be in same group"* - so it's possible for a record to be in more than one main group? – APC Mar 27 '19 at 06:45
  • I added the code to generate sample data, is better than images – Daniel Brughera Mar 27 '19 at 07:52
  • This is a **very** complex process and is better done outside SQL. If you still need to do this on SQL, check this question https://stackoverflow.com/questions/50619401/how-to-group-hierarchical-relationships-together-in-sql-server – EzLo Mar 27 '19 at 08:55
  • @APC, No.. what I meant was M2, M9, M10 should be in same subgroup derived based on the matching. All the matching will be within same Main group (G1) only. We don't have to look for any matching between records in different Main groups . – Dave Mar 27 '19 at 19:14

2 Answers2

1

In Microsoft SQL Server I would do the following, assuming that the data is in the table called "DataTable":

WITH
    [Matches] AS
    (
        SELECT
            D1.[Group],
            D1.[Member],
            D2.[Member] AS [PreviousMatchingMember]
        FROM
            [DataTable] AS D1
            OUTER APPLY (SELECT TOP (1) [Member]
                         FROM [DataTable]
                         WHERE
                             [Group] = D1.[Group] AND
                             [Member] < D1.[Member] AND
                             ([Address] = D1.[Address] OR
                              [Phone] = D1.[Phone] OR
                              [Email] = D1.[Email])
                         ORDER BY
                             [Member]) AS D2
    ),
    [Groups] AS
    (
        SELECT
            [Group],
            [Member],
            [PreviousMatchingMember],
            'NG' + LTRIM(ROW_NUMBER() OVER (ORDER BY [Group], [Member])) AS [NewGroup]
        FROM
            [Matches]
        WHERE
            [PreviousMatchingMember] IS NULL
    UNION ALL
        SELECT
            M.[Group],
            M.[Member],
            M.[PreviousMatchingMember],
            G.[NewGroup]
        FROM
            [Groups] AS G
            INNER JOIN [Matches] AS M ON
                M.[Group] = G.[Group] AND
                M.[PreviousMatchingMember] = G.[Member]
    )
SELECT
    G.[NewGroup],
    G.[Member],
    D.[Address],
    D.[Phone],
    D.[Email]
FROM
    [Groups] AS G
    INNER JOIN [DataTable] AS D ON
        D.[Group] = G.[Group] AND
        D.[Member] = G.[Member]
ORDER BY
    G.[NewGroup],
    G.[Member];

Edit:

As APC pointed out in his comment to your question, you have a (huge) problem if a record refers multiple other records (using different address/phone/email fields). You might end up having records that would potentially belong to different groups. You might decide that those groups should be considered as one group, but my solution here is not fit to solve such a complex problem.

Bart Hofland
  • 3,700
  • 1
  • 13
  • 22
  • Unfortunately, it can have such scenario. M1 can have matching with M2 on all three attributes (Address, Phone, Email); and it could have such a matching with any other record on one or many of the attributes. – Dave Mar 26 '19 at 21:45
  • Thank You .. It works fine with a sample set of data. However, performance seems to be a concern as it gets hung when running for the actual data (~12 million records across all groups) – Dave Mar 28 '19 at 02:21
  • This solution work fine with the sample data I gave. But, it has some issue with actual data as "Member" values are numeric data in actual data set. For instance, below data all members are supposed to be in same "New Group". But, they are coming into two New Groups. **`insert into DataTable values ('G1', '2728761', NULL, 'P1', NULL), ('G1', '6295586', 'A2', 'P1', NULL), ('G1', '6295606', 'A2', 'P3', NULL), ('G1', '6295930', NULL, 'P1', NULL), ('G1', '3007595', 'A2', 'P3', NULL);`** – Dave Mar 29 '19 at 01:08
  • @Dave, see my second approach, you can use a temp table if you don't want to add a new column to your core table, but as it isn't a tree hierarchy, the recursive CTE can fall into infinite loops since recursive CTEs do not allow the use of left join or use multiple references to the recursive member (like adding an exists to validate that the new records are not already there) – Daniel Brughera Mar 29 '19 at 15:06
  • @ Daniel Brughera , I tried your first solution. Its not getting the result as expected for below data: **Insert into DataTable values ('G1', '2728761', NULL, 'P1', NULL), ('G1', '6295586', 'A2', 'P1', NULL), ('G1', '6295606', 'A2', 'P3', NULL), ('G1', '6295930', NULL, 'P1', NULL), ('G1', '3007595', 'A2', 'P3', NULL); **The second solution seems to be a performance concern as it takes more than 10 minutes even for this sample data; and then gets disconnected without retrieving the results (Im keeping the data in a physical table). Looks like its getting into some infinite recursion. – Dave Mar 30 '19 at 03:52
  • This solution (second approach) works fine . However, the total record count comes around 25 million and it takes too long to process the full set. is there any way we can process them parallel, by dividing the groups into different batches (like 10000 groups in one batch) to speed up the processing? – Dave Apr 07 '19 at 00:18
0

It took me 3 CTEs and a couple of cups of coffee but here you have it... My primary concern is that I read this from the comments

It's a repeatable task. There will be several groups and we will have to do it for each group. The total record count across all groups could be millions.

This cannot be a repeatable task as the resources consumption will be high, I recommend you to use it to normalize your groups once and add the logic in your application or stored procedures to store the new data with the desired groups

DECLARE @table TABLE (id int not null identity, [Group] varchar(3), Member varchar(3), Address varchar(3), Phone varchar(3), Email varchar(3)) 

insert @table values
('G1', 'M1', 'A1', 'P1', 'E1'),
('G1', 'M2', 'A2', 'P2', 'E2'),
('G1', 'M3', 'A1', 'P3', 'E1'),
('G1', 'M4', 'A4', 'P3', 'E4'),
('G1', 'M5', 'A5', 'P5', 'E2'),
('G1', 'M6', 'A6', 'P6', 'E6'),
('G1', 'M7', 'A7', 'P6', 'E7'),
('G1', 'M8', 'A8', 'P8', 'E4'),
('G1', 'M9', 'A9', 'P9', 'E7'),
('G1', 'M10', 'A10', 'P10', 'E10');

with 
/* Find all matches
id  Member  MatchWith
1   M1  M3
2   M2  M5
3   M3  M1
3   M3  M4 ...
*/
matches as (
    SELECT t.id, t.[Group], t.Member, a.member as MatchWith
    from 
    @table t
    outer apply (
        select distinct member 
        from @table 
        where member <> t.member and [group] = t.[group] and (Address = t.Address OR Phone = t.Phone OR Email = t.Email)
    ) a
)
/* Stuffing the matches per member
id  Member  AllMatches
1   M1  M1,M3
2   M2  M2,M5
3   M3  M1,M3,M4 .....
*/
, matchsummary as (
    SELECT DISTINCT id, [Group], Member, STUFF((
                SELECT ',' + Member FROM (
                SELECT m.Member
                UNION ALL
                SELECT DISTINCT MatchWith
                FROM matches
                WHERE Member = m.Member) U
                ORDER BY Member
                FOR XML PATH('')
                ), 1, 1, '') as AllMatches
    FROM matches m
)
/* Recursive CTE to find "cousins" records (M1, M3 matches on Address and Email; M3 in turn matches with M4 on Phone)
id  Member  AllMatches  gr
1   M1  M1,M3   1
2   M2  M2,M5   2
3   M3  M1,M3,M4    1
4   M4  M3,M4,M8    1
*/
, tree as (
    select *, ROW_NUMBER() over (order by id) as gr
    from matchsummary where AllMatches LIKE member+'%'
    /* The groups are created using the Members who are the first one in their matches 
    id  Member  AllMatches  gr
    1   M1  M1,M3   1
    2   M2  M2,M5   2
    6   M6  M6,M7   3
    10  M10 M10 4
    */
    union all
    select s.*, t.gr 
    from matchsummary s
    join tree t on s.Member <> t.Member and s.[Group] = t.[Group] and s.AllMatches NOT LIKE s.member+'%' and t.AllMatches like '%' + s.Member
)
select * from tree
order by id
option(maxrecursion 0)

Output:

ID Group Member NewGroup

1 G1 M1 1

2 G1 M2 2

3 G1 M3 1

4 G1 M4 1

5 G1 M5 2

6 G1 M6 3

7 G1 M7 3

8 G1 M8 1

9 G1 M9 3

10 G1 M10 4

Second Option

Given the size of your table I recommend you to use this, I am not a big fan of loops but here I think they worth it, that way you don't need to process all your data at once,

First, you need to add a new column on your table to store the new group, my first thought was that would be better to change your application's logic to calculate that group when a new record is inserted, but thinking it better, an insert can cause several groups becoming one, and you probably need fast response in your application. So, you can set a job to regroup your data as often as you need it, if you have an UpdatedDate field in your table you also can refine this solution using a Log table and reprocess only groups that were modified after their last execution.

 IF OBJECT_ID('tempdb..#table') IS NOT NULL
    DROP TABLE #table;
CREATE TABLE #table ([Group] varchar(3), Member varchar(3), Address varchar(3), Phone varchar(3), Email varchar(3)) 

INSERT #table ([Group], Member, Address, Phone, Email)
VALUES
('G1', 'M1', 'A1', 'P1', 'E1'),
('G1', 'M2', 'A2', 'P2', 'E2'),
('G1', 'M3', 'A1', 'P3', 'E1'),
('G1', 'M4', 'A4', 'P3', 'E4'),
('G1', 'M5', 'A5', 'P5', 'E2'),
('G1', 'M6', 'A6', 'P6', 'E6'),
('G1', 'M7', 'A7', 'P6', 'E7'),
('G1', 'M8', 'A8', 'P8', 'E4'),
('G1', 'M9', 'A9', 'P9', 'E7'),
('G1', 'M10', 'A10', 'P10', 'E10');

ALTER TABLE #table ADD newGroup INT

/******************************************************************
START HERE
******************************************************************/

IF OBJECT_ID('tempdb..#Groups') IS NOT NULL
    DROP TABLE #Groups;

SELECT DISTINCT [Group] INTO #Groups FROM #table

DECLARE @Group VARCHAR(3)

WHILE EXISTS (SELECT 1 FROM #Groups)
BEGIN

    SELECT TOP 1 @Group = [Group] FROM #Groups

    UPDATE #table SET newGroup = NULL 
    WHERE [Group] = @Group

    DECLARE @newGroup INT = 1
    DECLARE @member varchar(3)

    WHILE EXISTS (SELECT 1 FROM #table WHERE [Group] = @Group AND newGroup IS NULL)
    BEGIN

        SELECT TOP 1 @member = member FROM #table WHERE [group] = @group AND newGroup IS NULL
    
        UPDATE #table SET newGroup = @newGroup
        WHERE Member = @member

        WHILE @@ROWCOUNT > 0
        BEGIN
            UPDATE T
            SET newGroup = @newGroup
            FROM #table T
            WHERE [Group] = @group AND newGroup IS NULL
            AND EXISTS (
                SELECT 1 FROM #table 
                WHERE newGroup = @newGroup
                AND (Address = t.Address OR Phone = t.Phone OR Email = t.Email)
            )
        END
        SET @newGroup += 1
    END
    DELETE #Groups WHERE [Group] = @Group
END

SELECT * FROM #table
Community
  • 1
  • 1
Daniel Brughera
  • 1,641
  • 1
  • 7
  • 14
  • Thank you for your effort. I'm not sure if you got confused with the info in comments. All the matching have to be within same main group (G1) only. Which means when there are multiple main groups: G1, G2, G3... , we don't have to look for any matching records between different main groups. ie, if M1 in main group G1 matches with M25 in G8 , we don't have to look for it. M25 will have to be checked for any matching with records in G8 only. Does it make sense ? – Dave Mar 27 '19 at 19:21
  • Absolutely, it can be scalable just adding the group condition to the CTEs – Daniel Brughera Mar 27 '19 at 21:28
  • Thanks..With the recursive CTEs, the performance seems to be a concern as the actual data contains around 13 million records (total count across all groups) – Dave Mar 28 '19 at 02:24
  • The requirement by itself requires a lot of resources as is mentioned, to win performance you need to redesign the solution, in the second paragraph you can find what I recommend – Daniel Brughera Mar 28 '19 at 05:58
  • @Dave I added the group conditions – Daniel Brughera Mar 28 '19 at 07:38
  • @Dave I included a second option that can help with the performance – Daniel Brughera Mar 28 '19 at 08:51
  • @ Daniel Brughera , I tried your first solution. Its not getting the result as expected for below data: **insert into DataTable values ('G1', '2728761', NULL, 'P1', NULL), ('G1', '6295586', 'A2', 'P1', NULL), ('G1', '6295606', 'A2', 'P3', NULL), ('G1', '6295930', NULL, 'P1', NULL), ('G1', '3007595', 'A2', 'P3', NULL);** The second solution seems to be a performance concern as it takes more than 10 minutes even for this sample data; and then gets disconnected without retrieving the results (Im keeping the data in a physical table). Looks like its getting into some infinite recursion. – Dave Mar 30 '19 at 02:20
  • Aren’t you missing a condition or something?? I tested and works fine – Daniel Brughera Mar 30 '19 at 06:52
  • @ Daniel Brughera , which option you tested with the new data given ? First or Second? – Dave Mar 31 '19 at 03:13
  • @ Daniel Brughera , Ah..Sorry.. It was my mistake as I declared the variable without enough length. Thank you!! However, will the use of @@ROWCOUNT cause any issue in a multithreaded environment? There will be other jobs running in the same schema while this SQL executes. – Dave Mar 31 '19 at 11:13
  • @@rowcount works for current connection and scope only, even if you have triggers in the updated table it only count the rows directly affected by your statement – Daniel Brughera Mar 31 '19 at 12:10
  • @ Daniel Brughera, It works fine as per requirements. However, in actual data, there are around 1.5 million distinct groups. Can we do something about running different sets of groups in parallel ?? Like, Groups 1-10000, 10000-20000 etc.. I could see when running it against full actual data it processes less than 200 groups in a minute; whereas when running for only 10000 groups it finishes within 1 minute. – Dave Mar 31 '19 at 22:32
  • 1
    @ Daniel Brughera, I could fix the performance issue.. I put an outer loop to process the full record count in batches of 1000 records each. Thanks. – Dave Apr 01 '19 at 02:04
  • This solution works fine . However, the total record count comes around 25 million and it takes too long to process the full set. is there any way we can process them parallel, by dividing the groups into different batches (like 10000 groups in one batch) to speed up the processing? – Dave Apr 07 '19 at 00:18
  • Do you need to process all groups each time you run it?? The log table is an option, and I assume that you can run different groups simultaneously, the same log table can help you to prevent two processes working on the same groups – Daniel Brughera Apr 07 '19 at 11:52
  • Yes.. I will have to process all groups each time. But as the groups are independent, we can process different groups simultaneously, some way. – Dave Apr 07 '19 at 21:49