5

I have a table consisting of about 70,000 rows and two columns (both VARCHAR(16)): id and parent_id.

I'd like to populate a 'depth' column that shows how far a particular record is from the 'root' node.

e.g.

id,parent_id,depth
A,NULL,0
B,A,1
C,A,1
D,B,2
E,D,3

etc.

I started by writing a query based on this answer to a similar question:

WITH myCTE(id, depth) AS
(
    SELECT id, 0 FROM objects where id = 'A'
    UNION ALL
    SELECT objects.id, depth + 1 FROM myCTE JOIN objects ON objects.parent_id = myCTE.id
)
SELECT id, depth FROM myCTE

With my dataset (~80,000 rows) the above takes almost two hours to execute!

I then wrote my query as a loop and got far better performance:

ALTER TABLE objects ADD depth INT NULL
DECLARE @counter int
DECLARE @total int
SET @counter = 0
UPDATE objects SET depth = 0 WHERE id = 'A'

SELECT @total = COUNT(*) FROM objects WHERE depth IS NULL

WHILE (@total > 0)
BEGIN
    UPDATE objects SET depth = @counter + 1 WHERE parent_id IN (
        SELECT id FROM objects WHERE depth = @counter
    )
    SELECT @total = COUNT(*) FROM objects WHERE depth IS NULL
    SET @counter = @counter + 1
END

The above code only takes a couple of minutes (and it has the benefit of adding the results to the existing table)

My question is whether my results are typical of using a CTE for this problem or whether there is something I have overlooked that might explain it? Indexes, maybe? (I have none on the table right now)

Community
  • 1
  • 1
Catchwa
  • 5,845
  • 4
  • 31
  • 57
  • Wow. In my experience, that sounds pretty non-typical. Have to turned on execution plans to see a comparison between the two? – Matt Feb 05 '13 at 11:57
  • 1
    @Matt - It is critical for even moderately large tables that the recursive part of the CTE be able to be satisfied by an index seek or [Performance can degrade horribly](http://dba.stackexchange.com/q/15596/3690) – Martin Smith Feb 05 '13 at 12:27

2 Answers2

8

You would need an index on parent_id. The recursive part of a CTE will always use a nested loops join and is impervious to join hints (Results are added to a stack spool and the rows are processed one by one in LIFO order)

Without an index on parent_id it will need to scan the table multiple times on the inner side of the nested loops. Performance will degrade exponentially with number of rows.

Your query without recursion will be able to use different join types (hash or merge) that only scan the table twice for each level of recursion. Most likely hash join in this case as you have no useful indexes that would avoid a sort.

Martin Smith
  • 438,706
  • 87
  • 741
  • 845
0

Have you considered using the HierarchyID datatype? It would make your life so much easier.

CREATE TABLE Groups.tblHierarchyNode
(
        NodeID              Int IDENTITY (0,1),
        NodeHID             HierarchyID NOT NULL,   -- DB Hierarchy ID of where I am in a tree
        HierarchyLevel      AS NodeHID.GetLevel(),  -- Numerical level of where I am in tree
)

I use this for a lot of my hierarchical tables now. You have to be a bit smarter on table population, but reports are a breeze, as is moving up and down the hierarchy, getting ancestors, descendants and so on.

Janine Rawnsley
  • 1,240
  • 2
  • 10
  • 20