6

I've got a SQL server table in which each row represents an edge in a graph network. The FromNodeID and ToNodeID are foreign keys to a node table, and the schema is something like this:

CREATE TABLE #Edges (
  EdgeID int identity (1,1),
  FromNodeID int,
  ToNodeID int
  );

INSERT INTO #Edges (FromNodeID, ToNodeID) VALUES
  (1,2),
  (1,3),
  (1,4),
  (2,3),
  (3,5),
  (4,5),
  (5,6);

Now, if I consider each edge to be directed (i.e., one way), then it's easy to work out all those nodes that I can get to directly from any node. I'd add an index to the FromNodeID column, and then run a query like this:

SELECT ToNodeID FROM #Edges WHERE FromNodeID = 3

Result: 5

But what would be the best way to structure my table/query if I want to treat each edge as unidirectional. i.e. starting from node 3, I'd like to get the results:

Result: 1, 2, 5

The simplest way I can think of would be to add an additional index to the ToNodeID column and then run a query like this:

SELECT ToNodeID FROM #Edges WHERE FromNodeID = 3 
UNION SELECT FromNodeID FROM #Edges WHERE ToNodeID = 3;

But this obviously involves combining result sets from two queries and doesn't seem very efficient - is there a better way to write this in a single query? (Note that I don't want to insert the reversed edges again into the table - I need to be able to treat the edges as either directed or undirected at runtime).

Thanks for any advice!

Jerry Nixon
  • 31,313
  • 14
  • 117
  • 233
Clare Nia
  • 105
  • 1
  • 3
  • If `#Edges` is secured from cases with FromNodeID = ToNodeID, your UNION version would win from using `UNION ALL` instead of `UNION`. And even if the self-referential nodes are allowed, you would be better off using `SELECT ... WHERE FromNodeID = 3 AND ToNodeID <> 3 UNION ALL SELECT ... WHERE FromNodeID <> 3 AND ToNodeID = 3 UNION ALL SELECT 3 FROM #Edges WHERE FromNodeID = 3 AND ToNodeID = 3`, but only if you do not need the nodes to be sorted (otherwise it appears to have worse performance than your version). – Andriy M Jan 27 '11 at 09:27

3 Answers3

4

But this obviously involves combining result sets from two queries and doesn't seem very efficient - is there a better way to write this in a single query?

This is efficient enough.

You could do this:

SELECT  CASE 3 WHEN FromNodeId THEN ToNodeId ELSE FromNodeId END
FROM    Edges
WHERE   3 IN (FromNodeId, ToNodeId)

but this will be essentially the same (will UNION these indexes under the hood).

Here's a script to test:

CREATE TABLE #Edges
        (
        EdgeID INT IDENTITY (1,1) PRIMARY KEY,
        FromNodeID int NOT NULL,
        ToNodeID int NOT NULL
        )
CREATE INDEX ix_edges_from ON #Edges (FromNodeID, ToNodeId)
CREATE INDEX ix_edges_to ON #Edges (ToNodeID, FromNodeId)
;
WITH    q (rn) AS
        (
        SELECT  1
        UNION ALL
        SELECT  rn + 1
        FROM    q
        WHERE   rn < 1000
        )
INSERT
INTO    #Edges (FromNodeId, ToNodeId)
SELECT  q1.rn, q2.rn
FROM    q q1
CROSS JOIN
        q q2
WHERE   (q1.rn + q2.rn) % 37 = 0
OPTION (MAXRECURSION 0)

For the UNION:

SELECT  ToNodeId
FROM    #Edges
WHERE   FromNodeId = 3
UNION
SELECT  FromNodeId
FROM    #Edges
WHERE   ToNodeId = 3


  |--Stream Aggregate(GROUP BY:([Union1006]))
       |--Merge Join(Concatenation)
            |--Index Seek(OBJECT:([tempdb].[dbo].[#Edges]), SEEK:([tempdb].[dbo].[#Edges].[FromNodeID]=(3)) ORDERED FORWARD)
            |--Index Seek(OBJECT:([tempdb].[dbo].[#Edges]), SEEK:([tempdb].[dbo].[#Edges].[ToNodeID]=(3)) ORDERED FORWARD)

For the IN:

  |--Compute Scalar(DEFINE:([Expr1003]=CASE WHEN (3)=[tempdb].[dbo].[#Edges].[FromNodeID] THEN [tempdb].[dbo].[#Edges].[ToNodeID] ELSE [tempdb].[dbo].[#Edges].[FromNodeID] END))
       |--Sort(DISTINCT ORDER BY:([tempdb].[dbo].[#Edges].[EdgeID] ASC))
            |--Concatenation
                 |--Index Seek(OBJECT:([tempdb].[dbo].[#Edges]), SEEK:([tempdb].[dbo].[#Edges].[FromNodeID]=(3)) ORDERED FORWARD)
                 |--Index Seek(OBJECT:([tempdb].[dbo].[#Edges]), SEEK:([tempdb].[dbo].[#Edges].[ToNodeID]=(3)) ORDERED FORWARD)

As you can see, the plans are essentially the same: they both take values from the corresponding indexes and concatenate the results.

The UNION query is in fact a little more efficient, since it uses a Merge Join to concatenate the results, and the records come out of the merge join naturally ordered, so the Stream Aggregate does not need to sort.

Quassnoi
  • 413,100
  • 91
  • 616
  • 614
1

Must you process the graph directly from SQL Server? If you are really concerned about performance, you should use one of the data-structures specifically for representing and processing graphs. Most of the work I have done with graphs (and I have done plenty) would have been infeasible if I used a generic database backend to consult the graphs.

One of the most effective representations that I have used is described in the appendices of a compiler book I have: Engineering a Compiler, by Keith Cooper and Linda Torczon.

Kevin A. Naudé
  • 3,992
  • 19
  • 20
0

There are three options I can think of: do it only in the table, only in the queries, or create a view. For the table, create triggers that enforce the symmetric closure (e.g. when inserting (a,b), also insert (b,a); when updating (a,b) to (c,d), delete the old symmetry-preserving (b,a) pair, then insert (d,c)). Note that this may not work, as some RDBMSs (I'm not sure if SQL Server is one) don't allow insertions/updates to the table that a trigger fired on.

In queries,

SELECT CASE FromNodeID WHEN 3 THEN ToNodeId ELSE FromNodeId END
  FROM #Edges 
    WHERE FromNodeID=3 OR ToNodeID=3

For the view, create one that's the symmetric closure of the original table. I think you'd still have to use a UNION, but it could simplify query writing.

CREATE VIEW undirected_edges (FirstEdgeID,SecondEdgeId)
  AS (SELECT FromNodeID, ToNodeID FROM #Edges)
  UNION DISTINCT
    (SELECT ToNodeID, FromNodeID FROM #Edges)
outis
  • 75,655
  • 22
  • 151
  • 221