Assign group_id to all linked records in many-to-many table

Question

There are 3 tables in my DB; Table 1 has 'Collateral', Table 2 has 'Loans', Table 3 is a multi-link table between 1 and 2; let's call it 'Loan_Collateral_link'.

A collateral can belong to 1 or more Loans, and a Loan can have multiple Collaterals.

What I want to achieve, is create a separate result set (or table) in which I group together all Loans and Collaterals which are in any way linked to eachother through different Loans and/or Collaterals.

Loans:

ID	name
Loan1	ABC
Loan2	DEF
Loan3	GHI
Loan4	JKL
Loan5	MNO

Collaterals:

ID	name
Coll1	Col1
Coll2	Col2
Coll3	Col3

Loan_Collateral_link:

Loan_ID	Collateral_ID
Loan1	Col1
Loan2	Col1
Loan2	Col3
Loan3	Col2
Loan4	Col2
Loan5	Col1
Loan5	Col3

So you see Loan1, Loan2 and Loan5 are sharing col1, so I want them to be grouped. Col3 should be in that same group, as it is linked through Loan2.

Loan4 and Loan3 are linked through Col2, so they should be a separate group.

The resultset would be something as per below:

Groups:

Group_ID	item_ID
Group1	Loan1
Group1	Loan2
Group1	Loan5
Group1	Col1
Group1	Col3
Group2	Loan3
Group2	Loan4
Group2	Col2

I'm trying to write a script, probably using a loop? to get this done. I was thinking to loop over each record in the Loan_Collateral_Link table, write every link that I find for the record into a temp_table. Then loop over this temp_table to find all linked records. However, I can't really seem to work it out conceptually, as the loop should somehow reference itself.

Something like;

--ForEach Loan_Collateral_Link Loan;
    --If not exists - Insert LOAN_ID into TempTable
    --If not exist - Insert COL_ID into TempTable
        --ForEach TempTable COL_ID;
            --Select * FROM Loan_Collateral_Link where Loan_ID or Col_ID matches
                --> If not exists - Insert LOAN_ID / COL_ID into TempTable

But this seems a bit like an infite loop perhaps?

Perhaps I should order all Loans, than go over Loan per Loan, add all the Col's, and in the loop per loan select all other loans with this loan. Than go over all the new Cols and select all matching Loans. However, than I have new Loans, so I need to go back and fetch their possible Cols. So again, how many times do I loop? Use a flag per collateral and keep looping until all flags are fulfilled?

Vladimir Baranov · Accepted Answer · 2022-08-27T12:30:53.720

A while ago I wrote an answer to the question How to find all connected subgraphs of an undirected graph. Your question here looks like the same problem.

Essentially, it treats the data as edges in a graph and traverses recursively all edges of the graph, stopping when the loop is detected. Then it puts all found loops in groups and gives each group a number.

For the detailed explanation of the query see that answer. I adapted that query to your schema here.

Also, this problem can be solved in linear time. Wikipedia has description of the algorithm. But, the T-SQL method shown here is far from optimal. If you still want to do it in T-SQL you can try the following. Instead of trying to find all groups at once, pick a single starting Identifier (add WHERE Ident='Loan1' to CTE_Idents and CTE_Pairs), then delete from the big table all identifiers that were appended to this group. Repeat until the big table is empty. At least you could see the progress. Proper indexes on Ident columns would not hurt as well.

Sample data

DECLARE @T TABLE (LoanID varchar(50), CollateralID varchar(50));
INSERT INTO @T (LoanID, CollateralID) VALUES
('Loan1', 'Col1'),
('Loan2', 'Col1'),
('Loan2', 'Col3'),
('Loan3', 'Col2'),
('Loan4', 'Col2'),
('Loan5', 'Col1'),
('Loan5', 'Col3'),

-- plus an extra group with some more interesting links
('Loan80', 'Col80'),
('Loan81', 'Col80'),
('Loan80', 'Col81'),
('Loan81', 'Col81'),
('Loan82', 'Col81')
;

Query

WITH
CTE_Idents
AS
(
    SELECT LoanID AS Ident
    FROM @T

    UNION

    SELECT CollateralID AS Ident
    FROM @T
)
,CTE_Pairs
AS
(
    SELECT LoanID AS Ident1, CollateralID AS Ident2
    FROM @T

    UNION

    SELECT CollateralID AS Ident1, LoanID AS Ident2
    FROM @T
)
,CTE_Recursive
AS
(
    SELECT
        CAST(CTE_Idents.Ident AS varchar(8000)) AS AnchorIdent 
        , Ident1
        , Ident2
        , CAST(',' + Ident1 + ',' + Ident2 + ',' AS varchar(8000)) AS IdentPath
        , 1 AS Lvl
    FROM 
        CTE_Pairs
        INNER JOIN CTE_Idents ON CTE_Idents.Ident = CTE_Pairs.Ident1

    UNION ALL

    SELECT 
        CTE_Recursive.AnchorIdent 
        , CTE_Pairs.Ident1
        , CTE_Pairs.Ident2
        , CAST(CTE_Recursive.IdentPath + CTE_Pairs.Ident2 + ',' AS varchar(8000)) AS IdentPath
        , CTE_Recursive.Lvl + 1 AS Lvl
    FROM
        CTE_Pairs
        INNER JOIN CTE_Recursive ON CTE_Recursive.Ident2 = CTE_Pairs.Ident1
    WHERE
        CTE_Recursive.IdentPath NOT LIKE CAST('%,' + CTE_Pairs.Ident2 + ',%' AS varchar(8000))
)
,CTE_RecursionResult
AS
(
    SELECT AnchorIdent, Ident1, Ident2
    FROM CTE_Recursive
)
,CTE_CleanResult
AS
(
    SELECT AnchorIdent, Ident1 AS Ident
    FROM CTE_RecursionResult

    UNION

    SELECT AnchorIdent, Ident2 AS Ident
    FROM CTE_RecursionResult
)
SELECT
    CTE_Idents.Ident
    ,CASE WHEN CA_Data.XML_Value IS NULL 
    THEN CTE_Idents.Ident ELSE CA_Data.XML_Value END AS GroupMembers
    ,DENSE_RANK() OVER(ORDER BY 
        CASE WHEN CA_Data.XML_Value IS NULL 
        THEN CTE_Idents.Ident ELSE CA_Data.XML_Value END
    ) AS GroupID
FROM
    CTE_Idents
    CROSS APPLY
    (
        SELECT CTE_CleanResult.Ident+','
        FROM CTE_CleanResult
        WHERE CTE_CleanResult.AnchorIdent = CTE_Idents.Ident
        ORDER BY CTE_CleanResult.Ident FOR XML PATH(''), TYPE
    ) AS CA_XML(XML_Value)
    CROSS APPLY
    (
        SELECT CA_XML.XML_Value.value('.', 'NVARCHAR(MAX)')
    ) AS CA_Data(XML_Value)
WHERE
    CTE_Idents.Ident IS NOT NULL
ORDER BY GroupID, Ident;

Result

+--------+-----------------------------------+---------+
| Ident  |           GroupMembers            | GroupID |
+--------+-----------------------------------+---------+
| Col1   | Col1,Col3,Loan1,Loan2,Loan5,      |       1 |
| Col3   | Col1,Col3,Loan1,Loan2,Loan5,      |       1 |
| Loan1  | Col1,Col3,Loan1,Loan2,Loan5,      |       1 |
| Loan2  | Col1,Col3,Loan1,Loan2,Loan5,      |       1 |
| Loan5  | Col1,Col3,Loan1,Loan2,Loan5,      |       1 |
| Col2   | Col2,Loan3,Loan4,                 |       2 |
| Loan3  | Col2,Loan3,Loan4,                 |       2 |
| Loan4  | Col2,Loan3,Loan4,                 |       2 |
| Col80  | Col80,Col81,Loan80,Loan81,Loan82, |       3 |
| Col81  | Col80,Col81,Loan80,Loan81,Loan82, |       3 |
| Loan80 | Col80,Col81,Loan80,Loan81,Loan82, |       3 |
| Loan81 | Col80,Col81,Loan80,Loan81,Loan82, |       3 |
| Loan82 | Col80,Col81,Loan80,Loan81,Loan82, |       3 |
+--------+-----------------------------------+---------+

Thank you for your answer! I'm trying to see if I can get this more performant (testing on a 300K record dataset now, has been running for over an hour). Got rid of the 'GroupMembers' column as this isn't required. Trying to read the SQL to see if I might be able to get rid of one of the CROSS APPLY's, but it is somewhat complicated. — Karel-Jan Misseghers, Aug 29 '22 at 13:48
300K rows will take forever in T-SQL. Read it all in memory and write your code in some procedural language like C++. To get an idea how slow it will be in T-SQL make a table with 1K rows, then 2K rows, then 4K, then 8K and so on and see how the run time increases non-linearly in T-SQL. — Vladimir Baranov, Aug 29 '22 at 13:53
You can also try to measure which parts of the query take the most time. For example, see how long it takes to get just `CTE_RecursionResult` - if that time is reasonable, then save it into a temp table, index it and proceed like that saving intermediary results into temp tables. — Vladimir Baranov, Aug 29 '22 at 14:02
Seems like using this query we ended up in an infinite loop using our data. — Karel-Jan Misseghers, Sep 01 '22 at 10:29
@Karel-JanMisseghers, it is unlikely, unless you have a so long chain of links that `IdentPath` grew beyond 8000 characters. Make it run step-by-step. Pick one starting value (add `WHERE Ident='Loan1'` to `CTE_Idents` and `CTE_Pairs`) and run only `CTE_Recursive`. Its result would be a subgraph (a loop) of the main graph. Save this intermediary result in a separate table. Pick another starting value that is not in this intermediary result and repeat. — Vladimir Baranov, Sep 01 '22 at 12:54
The recursive query is built to check and stop when the loop (subgraph) is detected. This part: `WHERE CTE_Recursive.IdentPath NOT LIKE CAST('%,' + CTE_Pairs.Ident2 + ',%' AS varchar(8000))` detects if a certain value appeared before and stops the recursion, so it should not run infinitely. — Vladimir Baranov, Sep 01 '22 at 12:55

Alexey · Answer 2 · 2022-08-29T15:42:44.780

You can achieve that you want using a query like this

WITH loans AS (
    SELECT 
        collateral_id,
        ',' + STRING_AGG(loan_id, ',') WITHIN GROUP (ORDER BY loan_id) + ',' AS loans,
        COUNT(loan_id) AS loan_count
    FROM loan_collateral_link
    GROUP BY collateral_id
),
loans_rn AS (
    SELECT
        *,
        DENSE_RANK() OVER (ORDER BY loan_count DESC) AS rn
    FROM loans
),
cte AS (
    SELECT 
        collateral_id, 
        loans, 
        loan_count, 
        collateral_id AS gid 
    FROM loans_rn 
    WHERE rn = 1
    UNION ALL
    SELECT 
        l.*, 
        cte.gid 
    FROM loans l
    JOIN cte ON l.loan_count < cte.loan_count 
                AND l.collateral_id != cte.collateral_id 
                AND cte.loans LIKE '%' + l.loans + '%'
),
collaterals AS (
    SELECT collateral_id, gid, loans FROM cte
    UNION ALL
    SELECT collateral_id, collateral_id, loans
    FROM loans WHERE NOT EXISTS (
        SELECT 1 
        FROM cte 
        WHERE cte.collateral_id = loans.collateral_id
    )
)
SELECT 
    item_id,
    'Group' + CAST(DENSE_RANK() OVER (ORDER BY gid) AS VARCHAR(MAX)) AS group_id
FROM (    
    SELECT 
        collateral_id AS item_id, 
        gid 
    FROM collaterals
    UNION ALL
    SELECT DISTINCT 
        l.loan_id, 
        (SELECT TOP 1 
             gid 
         FROM collaterals c 
         WHERE c.loans LIKE '%,'+l.loan_id + ',%')
    FROM loan_collateral_link l
) groups
ORDER BY group_id, item_id

Here I am aggregating the loan_id values for each collateral_id and getting a count of them (in loans CTE) so that I can select the largest groups later in the recursive CTE.

Then I use UNION ALL operator (in collaterals CTE) to get all the collateral_id values with corresponding gid value (which is a group id flag).

And finally, I use UNION ALL operator to include all the load_id values with corresponding gid value, which is used to generate group_id for each group.

Output

item_id	group_id
Col1	Group1
Col3	Group1
Loan1	Group1
Loan2	Group1
Loan5	Group1
Col2	Group2
Loan3	Group2
Loan4	Group2
Col80	Group3
Col81	Group3
Loan80	Group3
Loan81	Group3
Loan82	Group3

db<>fiddle here

Your query produces correct result for the given sample data, but I'm afraid, for a bit more complicated data it would generate incorrect groups. For example, try to add these rows to the sample data: ('Loan10', 'Col10'), ('Loan11', 'Col10'), ('Loan10', 'Col11'), ('Loan11', 'Col11'), ('Loan12', 'Col11'), They all should go into one new group. Your query creates two groups for them. Most likely your query works only for certain depths of the links and it can't work with arbitrary depths. — Vladimir Baranov, Aug 27 '22 at 12:24
I adjusted your fiddle with some extra data: https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=a96f510db6974007ac24ea9f0c01053a — Vladimir Baranov, Aug 27 '22 at 12:35
@VladimirBaranov, I understand what you mean. I've updated my answer. — Alexey, Aug 28 '22 at 09:31
I added few more rows to the sample data. They essentially link everything together: https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=b9f79517259548c0c3f9d0753dc823b2 I expect that the result should have only one group, but your new query produces 6 groups instead. — Vladimir Baranov, Aug 28 '22 at 12:20

Assign group_id to all linked records in many-to-many table

2 Answers2