0

I have a table in SQL Server with two columns, containing integers, for example:

 1  2 
 2  7 
 5  7 
 7 10 
10 11 
12 13 
13 14

I need to group them based on common integers. In the example 1, 2 and 2, 7 get into the first group because they share common integer 2. 5, 7 get into this first group as well as they share 7. So does 7, 10 follow into this first group. And 10, 11 follow into first group as well.

But 12 and 13 have no common members with the first group, so they create their own group, where 13, 14 are added.

So in the output we have

1 1
1 2
1 5
1 7
1 10
1 11
2 12
2 13
2 14

I hope the principle is clear. The naming of the groups is irrelevant.

Table is filtered so that left integer is less than the right one and every row is unique.

To achieve this goal I've written a code in T-SQL using recursive query ([dbo].[CDI] is the source table) :

CREATE TABLE [dbo].[CDI_proxy]
(
    [ID1] [bigint] NOT NULL,
    [ID2] [bigint] NOT NULL,
    primary key ([ID1], [ID2])
) ON [PRIMARY]

CREATE TABLE [dbo].[CDL]
(
    [ID1] [bigint] NOT NULL,
    [ID2] [bigint] NOT NULL,
    [cnt] [int] NOT NULL default(0),
    primary key ([ID1], [ID2])
) ON [PRIMARY]

create nonclustered index IX_1
on [dbo].[CDL] ([cnt]) include ([ID1], [ID2])

CREATE TABLE [dbo].[CDR]
(
    [ID1] [bigint] NOT NULL,
    [ID2] [bigint] NULL
) ON [PRIMARY]

insert into [dbo].[CDI_proxy]
    select
        d1.ID1,
        d2.ID2
    from
        [dbo].[CDI] d1

    union

    select
        d2.ID2,
        d2.ID1
    from
        [dbo].[CDI] d2;


WITH cte([ID1], [ID2], LVL) AS
(
    --Anchor Member
    (SELECT
         d1.ID1,
         d1.ID2,
         0 as LVL
     FROM 
         [dbo].[CDI_proxy] d1
    )

    UNION ALL

    --Recursive Member
    SELECT 
        r.[ID1],
        cte.[ID2],
        LVL + 1 AS LVL
    FROM
        [dbo].[CDI_proxy] r
    INNER JOIN
        cte ON r.ID2 = cte.ID1
)
INSERT INTO [dbo].[CDL]([ID1], [ID2])
    SELECT DISTINCT
        cte1.[ID1],
        cte1.[ID2]
    FROM 
        cte cte1
    OPTION (MAXRECURSION 0)

UPDATE [dbo].[CDL]
SET [cnt] = ag.cnt
FROM 
    (SELECT cdl.ID1, COUNT(cdl.ID2) AS cnt
     FROM [dbo].CDL cdl
     GROUP BY cdl.ID1) ag
WHERE ag.ID1 = [CDL].ID1

INSERT INTO [dbo].[CDR] ([ID1], [ID2])
    SELECT
        [ID1], [ID2]
    FROM
        (SELECT 
             cdl.*,
             ROW_NUMBER() OVER (PARTITION BY cdl.ID2 
                                ORDER BY cdl.cnt DESC, cdl.ID1 DESC) rnk
         FROM 
             [dbo].[CDL] cdl) cdl
    WHERE 
        rnk = 1

I run this script on approx. 5 million rows and it takes 3 hours to run without any ending (I turn it out). If I change part of the script like this

 --Recursive Member
    SELECT r.[ID1],
            cte.[ID2],
            LVL + 1 AS LVL
    FROM [dbo].[CDI_proxy] r
        inner join
        cte ON r.ID2=cte.ID1
        where LVL > 5

)

INSERT INTO ...

then it runs for 3 minutes and when afterwards I look at the result of the query

select id1, count(*) cnt
from dbo.CDR
group by id1
having count(*) > 5
order by cnt desc

then the highest group has only 8 members.

I suspect that my query goes into infinite recursion when LVL is less than 5. Is it possible and if yes then how?

Or did I make a mistake in my code?

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
  • I swear I saw this exact question a month or so ago. One of your classmates, maybe? I'll see if I can find it... – Tab Alleman Dec 13 '18 at 21:12
  • Have a look at a very similar question [How to find all connected subgraphs of an undirected graph](https://stackoverflow.com/q/35254260/4116017) – Vladimir Baranov Dec 14 '18 at 03:47
  • Possible duplicate of [How to find all connected subgraphs of an undirected graph](https://stackoverflow.com/questions/35254260/how-to-find-all-connected-subgraphs-of-an-undirected-graph) – Vladimir Baranov Dec 14 '18 at 03:49

1 Answers1

1

This is an example of walking through an undirected graph -- the edges go in both directions. In SQL Server, this is a bit messy.

But the idea is to start with each edge. Then add another edge on either end, comparing the nodes to be sure you are not generating any cycles. This can be done recursively.

Then, you can take the minimum node value that appears in the paths and use that to define the graph.

Here is some code:

with t as (
      select *
      from (values (1, 2), (2, 7), (5, 7), (7, 10), (10, 11), (12, 13), (13, 14)) v(x, y)
     ),
     tt as (
      select v.x, v.y
      from t cross apply
           (values (x, y), (y, x)) v(x, y)
     ),
     cte as (
      select (case when tt.x < tt.y then tt.x else tt.y end) as lowest, v.val, tt.x, tt.y, convert(varchar(max), concat(',', tt.x, ',', tt.y, ',')) as vals
      from tt cross apply
           (values (x), (y)) v(val)
      union all
      select (case when tt.y < cte.lowest then tt.y else cte.lowest end) as lowest, cte.val, cte.x, tt.y, concat(cte.vals, tt.y, ',') as vals
      from cte join
           tt
           on cte.y = tt.x and cte.vals not like concat('%,', tt.y, ',%')
      union all
      select (case when tt.x < cte.lowest then tt.x else cte.lowest end) as lowest, cte.val, tt.x, cte.y, concat(cte.vals, tt.x, ',') as vals
      from cte join
           tt
           on cte.x = tt.y and cte.vals not like concat('%,', tt.x, ',%')
     )
select min(lowest) as grp, val
from cte
group by val;

And the db<>fiddle.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786