How to prevent insertion of cyclic reference in SQL

Question

I have the following table:

create table dbo.Link
(
    FromNodeId int not null,
    ToNodeId int not null
)

Rows in this table represent links between nodes.

I want to prevent inserts or updates to this table from creating a cyclic relationship between nodes.

So if the table contains:

(1,2)
(2,3)

it should not be allowed to contain any of the following:

(1,1)
(2,1)
(3,1)

I'm happy to treat (1,1) separately (e.g. using a CHECK CONSTRAINT) if it makes the solution more straightforward.

I was thinking of creating an AFTER INSERT trigger with a recursive CTE (though there may be an easier way to do it).

Assuming this is the way to go, what would the trigger definition be? If there is a more elegant way, what is it?

Are you using the graph in an app layer like Java or C#? I think you should be enforcing these rules there. — Tim Biegeleisen, Oct 11 '17 at 11:44
Thanks for the suggestion @TimBiegeleisen. Can you point me to a good resource that discusses this approach and details why this is better? — fractor, Oct 11 '17 at 11:52
Preventing inserts/updates of `(1,1)` is easily done by a `CHECK CONSTRAINT` which blocks `FromNodeId=ToNodeId`. — TT., Oct 11 '17 at 11:56
Thanks @TT., yep happy to treat (1,1) specially if it makes it more straightforward. I'll update the question. — fractor, Oct 11 '17 at 12:00
@TT. But how to make this dynamic? Fractor, I would view the database just as means of storing the data, while you have a program somewhere maintaining the graph in memory as a collection of objects. — Tim Biegeleisen, Oct 11 '17 at 12:02
@TimBiegeleisen I follow you there, and in almost all cases I'd say do the validation somewhere else. — TT., Oct 11 '17 at 12:04
@TimBiegeleisen not sure what you mean by dynamic in this sense. I am aware of the approach of an app layer managing data integrity but have never been fully persuaded by it. I am happy to reconsider though would need help in understanding the benefits. Is there a resource you can recommend? — fractor, Oct 11 '17 at 12:07
Install SQL Server 2017 and make it a [graph database](https://learn.microsoft.com/sql/relational-databases/graphs/sql-graph-overview). (Half kidding -- if you have access to it, give it a try though, because the difficulties in implementing this stuff in an RDBMS is exactly why the feature exists.) — Jeroen Mostert, Oct 11 '17 at 12:08
@JeroenMostert Does it have cyclic reference detection though? — TT., Oct 11 '17 at 12:09
@TT.: I have no idea, I haven't tried it myself yet, at all. I expect that, at the very least, querying for a cycle would be possible to do efficiently, even if you couldn't directly add a constraint. Disregarding cycles, the benefits for other kinds of graph queries might be a more convincing reason to use it (if that's relevant for the OP's scenario). — Jeroen Mostert, Oct 11 '17 at 12:11
@JeroenMostert actually it is a SQL Server 2017 graph database. As far as I know this does not support arbitrary depth match clauses of the form MATCH(Node1-(Link)*->Node2). — fractor, Oct 11 '17 at 12:11
Well, technically, neither does any kind of recursive CTE, unless you crank the recursion limit up to infinity and hope the graph isn't willfully malicious with an extra long cycle. Efficiently checking for a cycle of any length is possible with sequential code, of course, but then you get into yucky cursors. Win some, lose some. — Jeroen Mostert, Oct 11 '17 at 12:13
As an aside, it is possible to enforce the absence of cycles by having a check of the form `FromNodeId < ToNodeId`, *if* you are OK with limiting the freedom of how the graph is built (effectively, you're forcing the user to do a topological sort before they store the graph, and maintain it afterwards). This isn't a universal solution, but I've used it once in a database where this restriction was acceptable. — Jeroen Mostert, Oct 11 '17 at 12:49
Depending on the actual data, there is no performant way to do this in a relational database. The only (possibly) performant way to do this with compiled code would be to maintain the entire node graph in memory. And even then it's questionable how performant it could be. You are trying to do the equivalent of a deadlock search on every write and this has so much overhead that even Operating Systems and DBMS's will only do it after a lock has been blocked for some arbitrary period (an option not available to you since your nodes aren't actual locks). — RBarryYoung, Oct 11 '17 at 13:16
@RBarryYoung I think you are wrong, it depends in what you consider "performant" if the table is read thousand times at second every operation will be too long.. but if the table will be locked for 1-100ms it could be considered performant for many scenarios.. also frequency of new insertions and "density" of graph will affect accepted performances. I have tested my solution on a table with 10k rows and inserting new valid links is very fast. — MtwStark, Oct 13 '17 at 10:53
@JeroenMostert I have written a function to test validity of new links without cursors and I think its efficiency is good. Please take a look at my answer. — MtwStark, Oct 13 '17 at 10:59

TT. · Accepted Answer · 2017-10-12T06:46:44.110

Note first that it is preferable to detect cycles in another environment as recursive CTEs aren't known for their good performance and neither is a trigger that would run for each insert statement. For large graphs, a solution based on the solution below will likely be inefficient.

Suppose you create the table as follows:

CREATE TABLE dbo.lnk (
    node_from INT NOT NULL,
    node_to INT NOT NULL,
    CONSTRAINT CHK_self_link CHECK (node_from<>node_to),
    CONSTRAINT PK_lnk_node_from_node_to PRIMARY KEY(node_from,node_to)
);

That would block inserts with node_from equal to node_to, and for rows that already exist.

The following trigger should detect cyclic references by throwing an exception if a cyclic reference is detected:

CREATE TRIGGER TRG_no_circulars_on_lnk ON dbo.lnk AFTER INSERT
AS
BEGIN
    DECLARE @cd INT;
    WITH det_path AS (
        SELECT
            anchor=i.node_from,
            node_to=l.node_to,
            is_cycle=CASE WHEN i.node_from/*anchor*/=l.node_to THEN 1 ELSE 0 END
        FROM
            inserted AS i
            INNER JOIN dbo.lnk AS l ON
                l.node_from=i.node_to
        UNION ALL
        SELECT
            dp.anchor,
            node_to=l.node_to,
            is_cycle=CASE WHEN dp.anchor=l.node_to THEN 1 ELSE 0 END
        FROM
            det_path AS dp
            INNER JOIN dbo.lnk AS l ON
                l.node_from=dp.node_to
        WHERE
            dp.is_cycle=0
    )
    SELECT TOP 1
        @cd=is_cycle 
    FROM 
        det_path
    WHERE
        is_cycle=1
    OPTION 
        (MAXRECURSION 0);

    IF @cd IS NOT NULL 
        THROW 67890, 'Insert would cause cyclic reference', 1;
END

I tested this for a limited number of inserts.

INSERT INTO dbo.lnk(node_from,node_to)VALUES(1,2); -- OK
INSERT INTO dbo.lnk(node_from,node_to)VALUES(2,3); -- OK
INSERT INTO dbo.lnk(node_from,node_to)VALUES(3,4); -- OK

And

INSERT INTO dbo.lnk(node_from,node_to)VALUES(2,3); -- PK violation
INSERT INTO dbo.lnk(node_from,node_to)VALUES(1,1); -- Check constraint violation
INSERT INTO dbo.lnk(node_from,node_to)VALUES(3,2); -- Exception: Insert would cause cyclic reference
INSERT INTO dbo.lnk(node_from,node_to)VALUES(3,1); -- Exception: Insert would cause cyclic reference
INSERT INTO dbo.lnk(node_from,node_to)VALUES(4,1); -- Exception: Insert would cause cyclic reference

It also detects cyclic references already present in the inserted rows if inserting more than one row at once, or if a path longer than one edge would be introduced in the graph. Going off on the same initial inserts:

INSERT INTO dbo.lnk(node_from,node_to)VALUES(8,9),(9,8);       -- Exception: Insert would cause cyclic reference
INSERT INTO dbo.lnk(node_from,node_to)VALUES(4,5),(5,6),(6,1); -- Exception: Insert would cause cyclic reference

While this works fine in the technical sense, the overhead of doing a cycle search for every write (even with the most efficient compiled code) can be severe, and a trigger is one of the worst places to put that much overhead. — RBarryYoung, Oct 11 '17 at 13:20
@RBarryYoung Sure, the last sentence is a warning just for that. The answer is for sports :-). Perhaps the SQL Server team will include cyclic reference detection for their new graph objects. — TT., Oct 11 '17 at 13:23
Thank you. This suits my needs right now. Should performance prove an issue I will consider one or more of the following: stored procedure (not sure this would improve performance), separate transitive closure table (results in duplicate data), preemptive validation check in client process (requires management of potential race conditions possibly necessitating single process or data locking). — fractor, Oct 12 '17 at 08:46
@fractor If a solution proves too slow using recursive CTEs, a solution that does not use recursive CTEs might still cut it... until that isn't fast enough anymore. T-SQL is probably not the best environment to do this stuff. I'm not too fond of MtwStark's answer though... that could be done better IMO. The recursive CTE way is a lot easier to write though. — TT., Oct 12 '17 at 08:50
I'm secretly hoping the SQL Server team will add support for arbitrary length paths similar to that available in neo4j. Will still have to review performance but it would be a neat solution. — fractor, Oct 12 '17 at 09:18
My solution is very fast and reliable, CTE is not, it is only easyer to write. I have updated it to handle multi insert and described it a bit to help the reading. In my opinion, T-SQL environment IS the right place where to test for data integrity and validity checks. If you are talking about a 10 rows table, you can code your solution where you want, also on paper.. but if your table will grow to millions rows, you can't consider to load it in memory an try some crappy high level coding and hoping it will be efficient.. — MtwStark, Oct 13 '17 at 11:12
@MtwStark Yup, only for small graphs should one use the CTE approach, average to large graphs one should steer clear from CTEs. Your approach would be better then. Still if I were to drop the CTE approach, I think I'd do it with visitor stacks and mimic recursion. Also, be careful with table variables... SQL Server when generating execution plans doesn't take cardinalities into account, and will estimate them as 1. Can work OK is the number of rows is small enough, but if the table variable can contain a lot of rows, you'd better switch to temporary tables. — TT., Oct 13 '17 at 15:25

MtwStark · Answer 2 · 2017-10-13T10:45:13.323

EDIT: handle multi-record inserts, moved logic in separate function

I have considered a procedural approach, it is very fast and almost independent from number of records in link table and graph "density"

I have tested it on a table with 10'000 links with nodes values from 1 to 1000.
It is really really fast an do not suffer of link table dimension or "density"

In addition, the function could be used to test values before insert or (for example) if you don't want to use a trigger at all an move test logic to the client.

Consideration on Recursive CTE: BE CAREFUL!
I have tested the accepted answer on my test table (10k rows) but after 25 minutes, I have cancelled the insert operation of one single row because the query was hung with no result... Downsizing the table to 5k rows the insert of a single record can last up to 2-3 minutes. It is very dependant from the "population" of the graph. If you insert a new path, or you are adding a node to a path with low "ramification" it is quite fast but you have no control over that. When the graph will be more "dense" this solution will blow up in your face.

Consider your needs very carefully.

So, let's see how to..

First of all I have set the PK of table to both columns and added an index on second column for full coverage. (the CHECK on FromNodeId<>ToNodeId is not needed cause the algorithm already cover this case).

CREATE TABLE [dbo].[Link](
    [FromNodeId] [int] NOT NULL,
    [ToNodeId] [int] NOT NULL,
 CONSTRAINT [PK_Link] PRIMARY KEY CLUSTERED ([FromNodeId],[ToNodeId])
) 
GO

CREATE NONCLUSTERED INDEX [ToNodeId] ON [dbo].[Link] ([ToNodeId])
GO

Then I have built a function to test the validity of a single link:

drop function fn_test_link
go
create function fn_test_link(@f int, @t int)
returns int
as
begin
    --SET NOCOUNT ON

    declare @p table (id int identity primary key, l int, t int, unique (l,t,id))
    declare @r int = 0
    declare @i int = 0

    -- link is not self-referencing
    if @f<>@t begin 
        -- there are links that starts from where new link wants to end (possible cycle)
        if exists(select 1 from link where fromnodeid=@t) begin

            -- PAY ATTENTION.. HERE LINK TABLE ALREADY HAVE ALL RECORDS ADDED (ALSO NEW ONES IF PROCEDURE IS CALLED FROM A TRIGGER AFTER INSERT)

            -- LOAD ALL THE PATHS TOUCHED BY DESTINATION OF TEST NODE
            set @i = 0
            insert into @p 
            select distinct @i, ToNodeId 
            from link 
            where fromnodeid=@t

            set @i = 1
            -- THERE IS AT LEAST A STEP TO FOLLOW DOWN THE PATHS
            while exists(select 1 from @p where l=@i-1) begin

                -- LOAD THE NEXT STEP FOR ALL THE PATHS TOUCHED
                insert into @p 
                select distinct @i, l.ToNodeId
                from link l
                join @p p on p.l = @i-1 and p.t = l.fromnodeid

                -- CHECK IF THIS STEP HAVE REACHED THE TEST NODE START
                if exists(select 1 from @p where l=@i and t=@f) begin
                    -- WE ARE EATING OUR OWN TAIL! CIRCULAR REFERENCE FOUND
                    set @r = -1
                    break
                end

                -- THE NODE IS STILL GOOD
                -- DELETE FROM LIST DUPLICATED ALREADY TESTED PATHS
                -- (THIS IS A BIG OPTIMIZATION, WHEN PATHS CROSSES EACH OTHER YOU RISK TO TEST MANY TIMES SAME PATHS)
                delete p 
                from @p p 
                where l = @i 
                and (exists(select 1 from @p px where px.l < p.l and px.t = p.t)) 

                set @i = @i + 1
            end
            if @r<0
                -- a circular reference was found 
                set @r = 0
            else
                -- no circular reference was found
                set @r = 1

        end else begin 
            -- THERE ARE NO LINKS THAT STARTS FROM TESTED NODE DESTINATIO (CIRCULAR REFERENCE NOT POSSIBLE)
            set @r = 1
        end
    end; -- link is not self-referencing

    --select * from @p 

    return @r

end
GO

Now let's call it from a trigger.
If more than a row will be inserted the trigger will test each link against the whole insert (old table + new recs), if all are valid and the final table will be consistent the insert will complete, if one of them is not valid the insert will abort.

DROP TRIGGER tr_test_circular_reference 
GO
CREATE TRIGGER tr_test_circular_reference ON link AFTER INSERT
AS
BEGIN
    SET NOCOUNT ON

    declare @p table (id int identity primary key, l int, f int, t int)
    declare @f int = 0
    declare @t int = 0

    declare @n int = 0
    declare @i int = 1

    declare @ins table (id int identity primary key, f int, t int)

    insert into @ins select * from inserted     

    set @n = @@ROWCOUNT;

    -- there are links to insert
    while @i<=@n begin

        -- load link
        select @f=f, @t=t from @ins where id = @i

        if dbo.fn_test_link(@f, @t)=0 begin
            declare @m nvarchar(255)                    
            set @m = formatmessage('Insertion of link (%d,%d) would cause circular reference (n.%d)', @f, @t, @i);
            THROW 50000, @m, 1
        end

        set @i = @i + 1
    end

END
GO

I hope this will help

If you go this way, I'd prohibit direct inserts on the table and demand you call a stored procedure for inserting edges that incorporates this logic instead. A trigger that prohibits multiple rows is ugly; a trigger that allows it but only by looping a loop would be an unfairly hidden performance drag. I mean, just try doing this in SSMS for a big table and forget to `SET NOCOUNT ON`... which reminds me, your trigger is missing `SET NOCOUNT ON`. Usually it's just a nicety to do this, but if you're running an arbitrary number of statements it's essential. — Jeroen Mostert, Oct 11 '17 at 13:36
why? it is a trigger! you don't have to call any procedure..multi insert is easy to enable.. I was just too lazy to add the outer cycle.. ok for the `SET NOCOUNT ON` but you can add it.. this is not the problem.. it is an optimization.. come on — MtwStark, Oct 11 '17 at 13:40
If your graph is big, and you've got a table of, say, 100.000 rows, and you do an `INSERT` in your SSMS window, sit back and watch the fireworks as SSMS dutifully prints all those `1 row(s) affected.` messages from the trigger statements to the output window (and allocates memory to hold them all). It's not pretty, and it's also slow. Yes, it's an optimization -- an actually useful one, though. (It's *also* a correctness issue if your application actually *uses* the row count, as this will be affected by the trigger, but SSMS freaking out was the first thing that came to mind.) — Jeroen Mostert, Oct 11 '17 at 13:44
The reason I'd make a stored procedure, incidentally, also has to do with SSMS: if you're an innocent end user and you, say, type or copy-paste a bunch of rows into a table, you expect this to be a simple, fast operation, right? Under water, these will just be separate inserts, and they'll invoke your trigger every time, and the larger the graph is, the slower that thing gets as it checks for cycles. And you can't cancel once it starts inserting... Now of course, a reasonable response is "don't copy-paste stuff with SSMS, then", but personally I just prefer expensive things to be explicit. — Jeroen Mostert, Oct 11 '17 at 13:52
if you want to be sure to not incurr in cyclic reference you NEED to test the rows one by one assuming an order of insert.. — MtwStark, Oct 11 '17 at 13:54
I'm not actually sure that's true, but I'm not even going to try my hand at an algorithm in T-SQL to do multiple DFS searches efficiently. :-) By the time you get to optimizing that kind of thing, you're hopefully moving the graph logic client side anyway. — Jeroen Mostert, Oct 11 '17 at 14:02
it's easy to see.. if you insert two rows, one dependant to the other they both result good to insert in initial table but the state of the table after this massive insert will be broken. for example `insert values (6,7),(7,6)` — MtwStark, Oct 11 '17 at 14:10
That doesn't mean there can't be an algorithm that, given a graph and multiple edges, detects whether adding those edges to the graph results in a cycle, more efficiently than looping a particular check over each of the edges added (your particular example illustrates this in that *we don't even need to know what the graph looks like*). But this probably also depends on how big the graph is, sparseness, how many edges are being added... Like I said, though, even if there is such an algorithm there's little point in translating it to T-SQL, so not really worth thinking about deeply here. — Jeroen Mostert, Oct 11 '17 at 14:18
This is exactly the kind of code that you *should not* put into a trigger. Having every DML statement to a table block for an unknown and highly variable amount of time without full transaction control and limitations on exception-handling, error-reporting, as well as having to handle multi-row issues (which you do by aboritng the insert and any on-going transaction) is a bad idea for good database design and administration. — RBarryYoung, Oct 11 '17 at 17:15
answer updated for multi insert and some optimization.. the validity check against a new row in a 40k rows takes about 5ms in the most cases, 25-50ms in a few cases and max 150ms in 0.02% of cases.. — MtwStark, Oct 13 '17 at 11:42

How to prevent insertion of cyclic reference in SQL

2 Answers2