3

I have a table with the following columns: group_id, parent_id, name

In this table parent_id is the group_id of another record. There is a 1 to N relationship of parents to children. This forms a hierarchy with only one top level group having NULL for a parent_id. There could be an arbitrary amount of depth, but in practice my hierarchy is never more than 20 levels deep.

I would like to retrieve every ancestor (a parent's parent and so on) of a group with a given group_id. I am concerned about the particular way this is returned.

I am using MS SQL 2005, but I am also interested in solutions using other RDBMSs.

I have found some similar questions, but they all seem to break down to recursion, looping or nested sets. I cannot use nested sets, because I cannot change the data structure. I would like to avoid recursion or looping where possible, or at least understand why it is not possible.

Here are some question I found while researching this:

How to select parent ids

Sql recursion without recursion

Community
  • 1
  • 1
Sqeaky
  • 1,876
  • 3
  • 21
  • 40
  • Your only goal is to get the uppermost top-level parent_id for every id? – wildplasser Apr 25 '12 at 21:49
  • The first link you posted will solve your problem using recursive common table expression (`with ...`) this will also work on most modern DBMS (with slight syntax changes) –  Apr 25 '12 at 21:56
  • @a_horse_with_no_name: that was where I was aiming at. Seems trivial. – wildplasser Apr 25 '12 at 21:57
  • @wildplasser Let me clarify "I would like to retrieve every parent" as in each parent of the group, including each in the middle until you get to the top. You can get to the top-most rather trivially with: Select * from table where parent_id is null; – Sqeaky Apr 25 '12 at 22:06
  • You want to avoid recursion in solving a problem which is inherently recursive? Seems very logical. – wildplasser Apr 25 '12 at 22:09
  • 1
    You should have said you also want to avoid *implicit* recursion (because the SQL itself is not recursive, only it's execution). But why do you want to avoid that? A CTE is probably the most efficient solution to this problem. –  Apr 25 '12 at 22:11
  • @a_horse_with_no_name Yeah that first link does solve it with recursion, I want to avoid that with this question. (Edit: I rephrased to make it less rude) – Sqeaky Apr 25 '12 at 22:15
  • But **why** do you want to avoid that? –  Apr 25 '12 at 22:16
  • @wildplasser SQL is a system for perform large complex operations on structured data-sets that took previous languages like C and COBOL many more lines to complete. I see no reason why this is fundamentally different than taking an iterative task and making it a single line. – Sqeaky Apr 25 '12 at 22:18
  • @a_horse_with_no_name Partly for academic reasons, Partly to clean up a messy codebase I have, and partly to prompt a best practices discussion. (Sorry for my rude phrasing earlier, I have corrected it.) – Sqeaky Apr 25 '12 at 22:20
  • Well, best practise (IMHO) is definitely the recursive CTE –  Apr 25 '12 at 22:23
  • Best practice is definitely a recursive CTE. Every procedural code (or even worse: explicit cursors) is subobtimal compared to something that is executed inside the engine. Both in execution speed and in number of lines. – wildplasser Apr 25 '12 at 22:47

3 Answers3

1

The operation is inherently looped. Because each node does not have any finite relation to their root, you must traverse in order to discover it.

If, for example, you knew that there was a maximum depth of N then you could create N LEFT OUTER JOINs in a single statement and display the last non-null parent ID returned this way.

The looping requirement is that you simply don't know what N is, and you cannot ask a declarative language like SQL to "figure it out"

Even if you can accomplish it with some built-in method, it will still be a loop or recursion, just obfuscated from you.

Matthew
  • 10,244
  • 5
  • 49
  • 104
  • In the first answer in the second link I posted, there is an oracle centric answer that does it in one clean query (Though I do not yet understand it). Even if the DB must perform some loop, wouldn't it still perform better in the native code that the DBMS was created in than re-implementing that loop in a SQL? How much caching and execution plan preparation do modern DMBSs do? – Sqeaky Apr 25 '12 at 22:14
1

If you know exactly how deep your data structure goes, you can just write the code out by hand:

DECLARE
    @parentId1 int
   ,@parentId2 int
   ...
   ,@parentId19 int
   ,@parentId20 int

SELECT
    @parentId1 = parent_id
FROM
    myTable
WHERE
    group_id = <someid>

SELECT
    @parentId2 = parent_id
FROM
    myTable
WHERE
    group_id = @parentId1

And so on. However, this will give you a whole bunch of extra code and won't perform any better than a loop, and it's horribly brittle. Adding a new level to the tree requires you to amend your code, which should be an instant code smell.

Think about it in terms of any other language. You have to perform task X a total of N times, where N is variable. How are you going to go about writing this? You'd use a loop. Now suppose that your data structure is a tree (which is what you've got here). How would you write this? You'd probably use recursion, unless you flatten the recursion into a loop.

The only caveat specific to MSSQL is the fact that, by default, the recursion stack is limited to a depth of 16. You're much better off using loops than recursion in MSSQL.

I'd typically do something like this:

-- Temp table will hold the results starting from the ID of the source item
-- through all its ancestors in ascending order
DECLARE @table TABLE (
    sequence int IDENTITY(1, 1)
   ,group_id int
)

DECLARE @groupId int

SELECT @groupId = <someid>

-- Loop backwards through the group's hierarchy inserting all parent IDs
-- into the temporary table
WHILE @groupId IS NOT NULL
BEGIN
    INSERT INTO @table (
        group_id
    )

    VALUES (
        @groupId
    )

    -- Get the ID of the group's parent ready to loop again
    SELECT @groupId = parent_id
    FROM mutable
    WHERE group_id = @groupId
END

-- Print the results
SELECT group_id
FROM @table

There are probably better ways, but that'll give you all IDs in a form you can manipulate easily, plus they'll be in the correct order.

Ant
  • 4,890
  • 1
  • 31
  • 42
  • 1
    This is ineffective because you do not know the nesting depth. Presume the nesting depth is one million, then you would have the code copy-pasta one-million times to function properly. – Matthew Apr 25 '12 at 21:49
  • 1
    @Matthew: That's precisely the point of the answer. You need to read the rest of the text after the code... – Ant Apr 25 '12 at 21:50
  • 1
    but the OP explicitly stated that he *does not* know the nesting depth. – Matthew Apr 25 '12 at 21:59
  • 1
    @MatthewPK: I know, but the *only* possible way to achieve a tree walk without using a loop or recursion is to write out the code by hand. The answer illustrates this possibility and then points out the problems with it, which are: a) you need to know the maximum depth (you could cope with shorter depths by performing NULL checks before each subsequent SELECT) and b) it's terrible code. In the last part of the question the OP asks *why* he must use looping or recursion, which I'm answering here. – Ant Apr 25 '12 at 22:04
  • In the second link I posted, there is an Oracle Centric that does it with a single select, I was hoping for a crafty solution that worked elsewhere. Also considering that fundamentally what SQL does is to iterate over a dataset and perform multiple comparisons and structure the data in a sophisticated way, makes me suspect most queries actually call a few loops in the RDBMS. – Sqeaky Apr 25 '12 at 22:24
  • Yes, the `connect by ... prior` was the predecessor (pun intended) of the recursive CTE (which was standardised). – wildplasser Apr 25 '12 at 22:49
1

You can create a temporary table similar to yours and populate it like this:

INSERT INTO #T(group_id, parent_id) SELECT group_id, parent_id FROM Your_Table

Now perform the exact same SQL five times:

INSERT INTO #T(group_id, parent_id) SELECT T2.group_id, T1.parent_id FROM #T T1 JOIN #T T2 ON T2.parent_id = T1.group_id LEFT JOIN #T C ON T2.group_id = C.group_id AND T1.parent_id = C.parent_id  AND C.group_id IS NULL
INSERT INTO #T(group_id, parent_id) SELECT T2.group_id, T1.parent_id FROM #T T1 JOIN #T T2 ON T2.parent_id = T1.group_id LEFT JOIN #T C ON T2.group_id = C.group_id AND T1.parent_id = C.parent_id  AND C.group_id IS NULL
INSERT INTO #T(group_id, parent_id) SELECT T2.group_id, T1.parent_id FROM #T T1 JOIN #T T2 ON T2.parent_id = T1.group_id LEFT JOIN #T C ON T2.group_id = C.group_id AND T1.parent_id = C.parent_id  AND C.group_id IS NULL
INSERT INTO #T(group_id, parent_id) SELECT T2.group_id, T1.parent_id FROM #T T1 JOIN #T T2 ON T2.parent_id = T1.group_id LEFT JOIN #T C ON T2.group_id = C.group_id AND T1.parent_id = C.parent_id  AND C.group_id IS NULL
INSERT INTO #T(group_id, parent_id) SELECT T2.group_id, T1.parent_id FROM #T T1 JOIN #T T2 ON T2.parent_id = T1.group_id LEFT JOIN #T C ON T2.group_id = C.group_id AND T1.parent_id = C.parent_id  AND C.group_id IS NULL

After this, your table now tracks ancestors rather than parents, up to 32 levels of distance. (2^5 = 32 and 32 > 20).

This is one efficient way to compute "transitive closure", although if you add looping instead of just repeating the same INSERT five times, you will not need your assumption about 20 levels any more. You should just stop when the INSERT inserts zero new rows. This kind of looping will help rather than hurt performance as well as the number of iterations will be very small.

Jirka Hanika
  • 13,301
  • 3
  • 46
  • 75
  • This is going to take a little while for me to digest, +1 for the novelty. – Sqeaky Apr 25 '12 at 22:01
  • 1
    @Squeaky - Here's how to digest it: The first query makes grand-parents into regular parents, but the parents also stay in #T. That's distance 1 or 2. The next iteration similarly adds ancestors with distance at most 2+2 = 4. The next one adds ancestors with distance at most 4+4 = 8. And so on. This is done by the (inner) JOIN. To explain the LEFT JOIN: never add ancestor relations that are already in #T, to prevent it from growing with duplicate rows. – Jirka Hanika Apr 25 '12 at 22:07
  • How much of this is likely to be cached to prevent this large seeming amount of work from being duplicated each call? – Sqeaky Apr 25 '12 at 22:32
  • @Squeaky - If it fits into RAM, then everything. Also, the temporary table should be created with a primary key consisting of group_id and parent_id for optimum speed of access. – Jirka Hanika Apr 25 '12 at 22:37