It took me 3 CTEs and a couple of cups of coffee but here you have it...
My primary concern is that I read this from the comments
It's a repeatable task. There will be several groups and we
will have to do it for each group. The total record count across all
groups could be millions.
This cannot be a repeatable task as the resources consumption will be high, I recommend you to use it to normalize your groups once and add the logic in your application or stored procedures to store the new data with the desired groups
DECLARE @table TABLE (id int not null identity, [Group] varchar(3), Member varchar(3), Address varchar(3), Phone varchar(3), Email varchar(3))
insert @table values
('G1', 'M1', 'A1', 'P1', 'E1'),
('G1', 'M2', 'A2', 'P2', 'E2'),
('G1', 'M3', 'A1', 'P3', 'E1'),
('G1', 'M4', 'A4', 'P3', 'E4'),
('G1', 'M5', 'A5', 'P5', 'E2'),
('G1', 'M6', 'A6', 'P6', 'E6'),
('G1', 'M7', 'A7', 'P6', 'E7'),
('G1', 'M8', 'A8', 'P8', 'E4'),
('G1', 'M9', 'A9', 'P9', 'E7'),
('G1', 'M10', 'A10', 'P10', 'E10');
with
/* Find all matches
id Member MatchWith
1 M1 M3
2 M2 M5
3 M3 M1
3 M3 M4 ...
*/
matches as (
SELECT t.id, t.[Group], t.Member, a.member as MatchWith
from
@table t
outer apply (
select distinct member
from @table
where member <> t.member and [group] = t.[group] and (Address = t.Address OR Phone = t.Phone OR Email = t.Email)
) a
)
/* Stuffing the matches per member
id Member AllMatches
1 M1 M1,M3
2 M2 M2,M5
3 M3 M1,M3,M4 .....
*/
, matchsummary as (
SELECT DISTINCT id, [Group], Member, STUFF((
SELECT ',' + Member FROM (
SELECT m.Member
UNION ALL
SELECT DISTINCT MatchWith
FROM matches
WHERE Member = m.Member) U
ORDER BY Member
FOR XML PATH('')
), 1, 1, '') as AllMatches
FROM matches m
)
/* Recursive CTE to find "cousins" records (M1, M3 matches on Address and Email; M3 in turn matches with M4 on Phone)
id Member AllMatches gr
1 M1 M1,M3 1
2 M2 M2,M5 2
3 M3 M1,M3,M4 1
4 M4 M3,M4,M8 1
*/
, tree as (
select *, ROW_NUMBER() over (order by id) as gr
from matchsummary where AllMatches LIKE member+'%'
/* The groups are created using the Members who are the first one in their matches
id Member AllMatches gr
1 M1 M1,M3 1
2 M2 M2,M5 2
6 M6 M6,M7 3
10 M10 M10 4
*/
union all
select s.*, t.gr
from matchsummary s
join tree t on s.Member <> t.Member and s.[Group] = t.[Group] and s.AllMatches NOT LIKE s.member+'%' and t.AllMatches like '%' + s.Member
)
select * from tree
order by id
option(maxrecursion 0)
Output:
ID Group Member NewGroup
1 G1 M1 1
2 G1 M2 2
3 G1 M3 1
4 G1 M4 1
5 G1 M5 2
6 G1 M6 3
7 G1 M7 3
8 G1 M8 1
9 G1 M9 3
10 G1 M10 4
Second Option
Given the size of your table I recommend you to use this, I am not a big fan of loops but here I think they worth it, that way you don't need to process all your data at once,
First, you need to add a new column on your table to store the new group, my first thought was that would be better to change your application's logic to calculate that group when a new record is inserted, but thinking it better, an insert can cause several groups becoming one, and you probably need fast response in your application. So, you can set a job to regroup your data as often as you need it, if you have an UpdatedDate field in your table you also can refine this solution using a Log table and reprocess only groups that were modified after their last execution.
IF OBJECT_ID('tempdb..#table') IS NOT NULL
DROP TABLE #table;
CREATE TABLE #table ([Group] varchar(3), Member varchar(3), Address varchar(3), Phone varchar(3), Email varchar(3))
INSERT #table ([Group], Member, Address, Phone, Email)
VALUES
('G1', 'M1', 'A1', 'P1', 'E1'),
('G1', 'M2', 'A2', 'P2', 'E2'),
('G1', 'M3', 'A1', 'P3', 'E1'),
('G1', 'M4', 'A4', 'P3', 'E4'),
('G1', 'M5', 'A5', 'P5', 'E2'),
('G1', 'M6', 'A6', 'P6', 'E6'),
('G1', 'M7', 'A7', 'P6', 'E7'),
('G1', 'M8', 'A8', 'P8', 'E4'),
('G1', 'M9', 'A9', 'P9', 'E7'),
('G1', 'M10', 'A10', 'P10', 'E10');
ALTER TABLE #table ADD newGroup INT
/******************************************************************
START HERE
******************************************************************/
IF OBJECT_ID('tempdb..#Groups') IS NOT NULL
DROP TABLE #Groups;
SELECT DISTINCT [Group] INTO #Groups FROM #table
DECLARE @Group VARCHAR(3)
WHILE EXISTS (SELECT 1 FROM #Groups)
BEGIN
SELECT TOP 1 @Group = [Group] FROM #Groups
UPDATE #table SET newGroup = NULL
WHERE [Group] = @Group
DECLARE @newGroup INT = 1
DECLARE @member varchar(3)
WHILE EXISTS (SELECT 1 FROM #table WHERE [Group] = @Group AND newGroup IS NULL)
BEGIN
SELECT TOP 1 @member = member FROM #table WHERE [group] = @group AND newGroup IS NULL
UPDATE #table SET newGroup = @newGroup
WHERE Member = @member
WHILE @@ROWCOUNT > 0
BEGIN
UPDATE T
SET newGroup = @newGroup
FROM #table T
WHERE [Group] = @group AND newGroup IS NULL
AND EXISTS (
SELECT 1 FROM #table
WHERE newGroup = @newGroup
AND (Address = t.Address OR Phone = t.Phone OR Email = t.Email)
)
END
SET @newGroup += 1
END
DELETE #Groups WHERE [Group] = @Group
END
SELECT * FROM #table