Query to fetch all referenced entities recursively

Question

I have a datamodels which consists of 'Claims' which (to make things simple for stackoverflow) only has an OpenAmount field. There are two other tables, 'ClaimCoupling' and 'ClaimEntryReference'.

The ClaimCoupling table directly references back to the Claim table and the ClaimEntryReference is effectively the booking of a received amount that can be booked over multiple claims (See ClaimEntry_ID). See this diagram;

For simplicity I've removed all amounts as that's not what I am currently struggling with.

What I want is a query that will start @ the Claim table, and fetches all a claim with an OpenAmount which is <> 0. However I want to be able to print out an accurate report of how this OpenAmount came to be, which means I'll need to also print out any Claims coupled to this claim. To make it even more interesting the same thing applies to the bookings, if a booking was made on claim X and claim Y and only X has an open amount I want to fetch both X and Y so I can then show the payment which was booked as a whole.

I've attempted to do this with a recursive CTE but this (rightfully) blows up on the circulair references. I figured I'd fix that with a simple where statement where I would say only recursively add records which are not yet part of CTE but this is not allowed....

    WITH coupledClaims AS (
    --Get all unique combinations 
    SELECT cc.SubstractedFromClaim_ID AS Claim_ID,
           cc.AddedToClaim_ID AS Linked_Claim_ID FROM dbo.ClaimCoupling cc
    UNION
    SELECT cc.AddedToClaim_ID AS Claim_ID,
           cc.SubstractedFromClaim_ID AS Linked_Claim_ID FROM dbo.ClaimCoupling cc
),
MyClaims as
(
  SELECT * FROM Claim WHERE OpenAmount <> 0
  UNION ALL
  SELECT c.* FROM coupledClaims JOIN MyClaims mc ON coupledClaims.claim_id = mc.ID JOIN claim c ON c.ID = coupledClaims.linked_Claim_ID
  WHERE c.ID NOT IN (SELECT ID FROM MyClaims)
)
SELECT * FROM MyClaims

After fiddling around with that for way too long I decided I'd do it with an actual loop... @@Rowcount and simply manually add them to a table variable but as I was writing this solution (which I'm sure I can get to work) I figured I'd ask here first because I don't like writing loops in TSQL as I always feel it's ugly and inefficient.

See the following sql Fiddle for the data models and some test data (I commented out the recursive part as otherwise I was not allowed to create a link);

http://sqlfiddle.com/#!6/129ad5/7/0

I'm hoping someone here will have a great way of handling this problem (likely I'm doing something wrong with the recursive CTE). For completion this is done on MS SQL 2016.

One way of detecting and handling loops in the data is [here](https://stackoverflow.com/questions/15080922/infinite-loop-cte-with-option-maxrecursion-0/15081353#15081353). As you recurse through the data each row keeps track of the path that was explored to reach it. Any new row that replicates an element already on the path is ignored. — HABO, Jul 26 '17 at 15:56
That's pretty clever way of handling the loop to be honest. I might spend some time putting this in and then checking the performance of the two solutions as this does actually loop 'once' before detecting it if I'm not mistaken. — F.B. ten Kate, Jul 27 '17 at 06:14
Okay, so I rebuild the query in the way you suggested and found this to be performance wise a slightly worse way of doing it that doing the recursion yourself. This is likely due to the casting of an ID to varchar and then concatenating the strings. To be fair it's a pain to analyse due to it being a lot of different statements and I have been unable to get the IO/CPU statistics for the entire query (rather than statement per statement) — F.B. ten Kate, Jul 27 '17 at 07:46

score 0 · Accepted Answer · edited Jun 20 '20 at 09:12

So here is what I've learned and done so far. Thanks to the comment of habo which refers to the following question; Infinite loop in CTE when parsing self-referencing table

Firstly I decided to at least 'solve' my problem and wrote some manual recursion, this solves my problem but is not as 'pretty' as the CTE solution which I was hoping/thinking would be easier to read as well as out perform the manual recursion solution.

Manual Recursion

/****************************/
/* CLAIMS AND PAYMENT LOGIC */
/****************************/
DECLARE @rows as INT = 0
DECLARE @relevantClaimIds as Table(
Debtor_ID INT,
Claim_ID int
)
SET NOCOUNT ON

--Get anchor condition
INSERT INTO @relevantClaimIds (Debtor_ID, Claim_ID)
select Debtor_ID, ID
from Claim c
WHERE OpenAmount <> 0

--Do recursion
WHILE @rows <> (SELECT COUNT(*) FROM @relevantClaimIds)
BEGIN
set @rows = (SELECT COUNT(*) FROM @relevantClaimIds)

--Subtracted
INSERT @relevantClaimIds (Debtor_ID, Claim_ID)
SELECT DISTINCT c.Debtor_ID, c.id
FROM claim c
inner join claimcoupling cc on cc.SubstractedFromClaim_ID = c.ID
JOIN @relevantClaimIds rci on rci.Claim_ID = cc.AddedToClaim_ID
--might be multiple paths to this recursion so eliminate duplicates
left join @relevantClaimIds dup on dup.Claim_ID = c.id
WHERE dup.Claim_ID is null

--Added
INSERT @relevantClaimIds (Debtor_ID, Claim_ID)
SELECT DISTINCT c.Debtor_ID, c.id
FROM claim c
inner join claimcoupling cc on cc.AddedToClaim_ID = c.ID
JOIN @relevantClaimIds rci on rci.Claim_ID = cc.SubstractedFromClaim_ID
--might be multiple paths to this recursion so eliminate duplicates
left join @relevantClaimIds dup on dup.Claim_ID = c.id
WHERE dup.Claim_ID is null

--Payments
INSERT @relevantClaimIds (Debtor_ID, Claim_ID)
SELECT DISTINCT c.Debtor_ID, c.id
FROM @relevantClaimIds f
join ClaimEntryReference cer on f.Claim_ID = cer.Claim_ID
JOIN ClaimEntryReference cer_linked on cer.ClaimEntry_ID = cer_linked.ClaimEntry_ID AND cer.ID <> cer_linked.ID
JOIN Claim c on c.ID = cer_linked.Claim_ID
--might be multiple paths to this recursion so eliminate duplicates
left join @relevantClaimIds dup on dup.Claim_ID = c.id
WHERE dup.Claim_ID is null
END

Then after I received and read the comment I decided to try the CTE solution which looks like this;

CTE Recursion

with Tree as
        (
        select Debtor_ID, ID AS Claim_ID, CAST(ID AS VARCHAR(MAX)) AS levels
        from Claim c
        WHERE OpenAmount <> 0

        UNION ALL
        SELECT c.Debtor_ID, c.id, t.levels + ',' + CAST(c.ID AS VARCHAR(MAX)) AS levels
        FROM claim c
        inner join claimcoupling cc on cc.SubstractedFromClaim_ID = c.ID
        JOIN Tree t on t.Claim_ID = cc.AddedToClaim_ID
        WHERE (','+T.levels+',' not like '%,'+cast(c.ID as varchar(max))+',%')

        UNION ALL
        SELECT c.Debtor_ID, c.id, t.levels + ',' + CAST(c.ID AS VARCHAR(MAX)) AS levels
        FROM claim c
        inner join claimcoupling cc on cc.AddedToClaim_ID = c.ID
        JOIN Tree t on t.Claim_ID = cc.SubstractedFromClaim_ID
        WHERE (','+T.levels+',' not like '%,'+cast(c.ID as varchar(max))+',%')

        UNION ALL
        SELECT c.Debtor_ID, c.id, t.levels + ',' + CAST(c.ID AS VARCHAR(MAX)) AS levels
        FROM Tree t
        join ClaimEntryReference cer on t.Claim_ID = cer.Claim_ID
        JOIN ClaimEntryReference cer_linked on cer.ClaimEntry_ID = cer_linked.ClaimEntry_ID AND cer.ID <> cer_linked.ID
        JOIN Claim c on c.ID = cer_linked.Claim_ID
        WHERE (','+T.levels+',' not like '%,'+cast(c.ID as varchar(max))+',%')
        )
select  DISTINCT Tree.Debtor_ID, Tree.Claim_ID
from Tree

This solution is indeed a lot 'shorter' and easier on the eyes but does it actually perform better?

Performance differences

Manual; CPU 16, Reads 1793, Duration 13

CTE; CPU 47, Reads 4001, Duration 48

Conclusion

Not sure if it's due to the varchar cast that is required in the CTE solution or that it has to do one extra iteration before completing it's recursion but it actually requires more resources on all fronts than the manual recursion.

In the end it is possible with CTE however looks aren't everything (thank god ;-)) performance wise sticking with the manual recursion seems like a better route.

For giggles it may be worth running the benchmark again using `VarChar(8000)` (or some other suitable size) rather than `VarChar(max)`. The database engine handles them quite differently. — HABO, Jul 27 '17 at 15:56
For the lulz I did, sadly, 8000 is the only suitable size. If I set it to anything lower you get the following message; 'Types don't match between the anchor and the recursive part in column "levels" of recursive query.'. When put at 8000 though the performance metrics are exactly the same. — F.B. ten Kate, Jul 28 '17 at 06:22

Query to fetch all referenced entities recursively

1 Answers1