How to represent a data tree in SQL?

Question

I'm writing a data tree structure that is combined from a Tree and a TreeNode. Tree will contain the root and the top level actions on the data. I'm using a UI library to present the tree in a windows form where I can bind the tree to the TreeView.

I will need to save this tree and nodes in the DB. What will be the best way to save the tree and to get the following features:

Intuitive implementation.
Easy binding. Will be easy to move from the tree to the DB structure and back (if any)

I had 2 ideas. The first is to serialize the data into a one liner in a table. The second is to save in tables but then, when moving to data entities I will loose the row states on the table on changed nodes.

Any ideas?

See also [What is the most efficient/elegant way to parse a flat table into a tree?](http://stackoverflow.com/questions/192220/what-is-the-most-efficient-elegant-way-to-parse-a-flat-table-into-a-tree/192462#192462) — Bill Karwin, Dec 26 '12 at 18:08
See also [What are the Options for Storing Hierarchical Data in a Relational Database?](http://stackoverflow.com/questions/4048151/what-are-the-options-for-storing-hierarchical-data-in-a-relational-database) — ghord, Apr 11 '15 at 12:20
If you can use PostgreSQL you can check the **ltree** extension: https://www.postgresql.org/docs/current/ltree.html — xonya, Aug 05 '19 at 17:04

score 62 · Answer 1 · edited Apr 09 '14 at 13:27

62

I've bookmarked this slidshare about SQL-Antipatterns, which discusses several alternatives: http://www.slideshare.net/billkarwin/sql-antipatterns-strike-back?src=embed

The recommendation from there is to use a Closure Table (it's explained in the slides).

Here is the summary (slide 77):

                  | Query Child | Query Subtree | Modify Tree | Ref. Integrity
Adjacency List    |    Easy     |     Hard      |    Easy     |      Yes
Path Enumeration  |    Easy     |     Easy      |    Hard     |      No
Nested Sets       |    Hard     |     Easy      |    Hard     |      No
Closure Table     |    Easy     |     Easy      |    Easy     |      Yes

edited Apr 09 '14 at 13:27

Eldritch Conundrum

8,452
6
42
50

answered Nov 28 '10 at 16:43

Björn

1,183
2
13
18

20

That's a very good reference, starting from slide 48 - http://www.slideshare.net/billkarwin/sql-antipatterns-strike-back/48 – mahemoff Jul 17 '14 at 14:35
3

Incredibly informative slides. I highly recommend everyone read through them. – ctrlplusb Dec 12 '19 at 15:58

score 43 · Answer 2 · edited Feb 18 '15 at 21:09

The easiest implementation is adjacency list structure:

id  parent_id  data

However, some databases, particularly MySQL, have some issues in handling this model, because it requires an ability to run recursive queries which MySQL lacks.

Another model is nested sets:

id lft rgt data

where lft and rgt are arbitrary values that define the hierarchy (any child's lft, rgt should be within any parent's lft, rgt)

This does not require recursive queries, but it slower and harder to maintain.

However, in MySQL this can be improved using SPATIAL abitilies.

See these articles in my blog:

for more detailed explanations.

score 13 · Answer 3 · answered Nov 02 '12 at 21:21

I'm suprised that nobody mentioned the materialized path solution, which is probably the fastest way of working with trees in standard SQL.

In this approach, every node in the tree has a column path, where the full path from the root to the node is stored. This involves very simple and fast queries.

Have a look at the example table node:

+---------+-------+
| node_id | path  |
+---------+-------+
| 0       |       |
| 1       | 1     |
| 2       | 2     |
| 3       | 3     |
| 4       | 1.4   |
| 5       | 2.5   |
| 6       | 2.6   |
| 7       | 2.6.7 |
| 8       | 2.6.8 |
| 9       | 2.6.9 |
+---------+-------+

In order to get the children of node x, you can write the following query:

SELECT * FROM node WHERE path LIKE CONCAT((SELECT path FROM node WHERE node_id = x), '.%')

Keep in mind, that the column path should be indexed, in order to perform fast with the LIKE clause.

the earlier link given by Björn for http://www.slideshare.net/billkarwin/sql-antipatterns-strike-back?src=embed covers this and also explains why it prefers to recommend Closure table, it's worth a read. — default_avatar, Nov 18 '13 at 12:41

score 10 · Answer 4 · answered Sep 23 '16 at 14:56

If you are using PostgreSQL you can use ltree, a package in the contrib extension (comes by default) which implements the tree data structure.

From the docs:

CREATE TABLE test (path ltree);
INSERT INTO test VALUES ('Top');
INSERT INTO test VALUES ('Top.Science');
INSERT INTO test VALUES ('Top.Science.Astronomy');
INSERT INTO test VALUES ('Top.Science.Astronomy.Astrophysics');
INSERT INTO test VALUES ('Top.Science.Astronomy.Cosmology');
INSERT INTO test VALUES ('Top.Hobbies');
INSERT INTO test VALUES ('Top.Hobbies.Amateurs_Astronomy');
INSERT INTO test VALUES ('Top.Collections');
INSERT INTO test VALUES ('Top.Collections.Pictures');
INSERT INTO test VALUES ('Top.Collections.Pictures.Astronomy');
INSERT INTO test VALUES ('Top.Collections.Pictures.Astronomy.Stars');
INSERT INTO test VALUES ('Top.Collections.Pictures.Astronomy.Galaxies');
INSERT INTO test VALUES ('Top.Collections.Pictures.Astronomy.Astronauts');
CREATE INDEX path_gist_idx ON test USING GIST (path);
CREATE INDEX path_idx ON test USING BTREE (path);

You can do queries like:

ltreetest=> SELECT path FROM test WHERE path <@ 'Top.Science';
                path
------------------------------------
 Top.Science
 Top.Science.Astronomy
 Top.Science.Astronomy.Astrophysics
 Top.Science.Astronomy.Cosmology
(4 rows)

score 7 · Answer 5 · edited Oct 30 '11 at 14:38

It depends on how you will be querying and updating the data. If you store all the data in one row, it's basically a single unit that you can't query into or partially update without rewriting all the data.

If you want to store each element as a row, you should first read Managing Hierarchical Data in MySQL (MySQL specific, but the advice holds for many other databases too).

If you're only ever accessing an entire tree, the adjacency list model makes it difficult to retrieve all nodes under the root without using a recursive query. If you add an extra column that links back to the head then you can do SELECT * WHERE head_id = @id and get the whole tree in one non-recursive query, but it denormalizes the database.

Some databases have custom extensions that make storing and retrieving heirarchical data easier, for example Oracle has CONNECT BY.

score 5 · Answer 6 · answered Dec 12 '18 at 20:02

As this is the top answer when asking "sql trees" in a google search, I will try to update this from the perspective of today (december 2018).

Most answers imply that using an adjacency list is both simple and slow and therefore recommend other methods.

Since version 8 (published april 2018) MySQL supports recursive common table expressions (CTE). MySQL is a bit late to the show but this opens up a new option.

There is a tutorial here that explains the use of recursive queries to manage an adjacency list.

As the recursion now runs completely within the database engine, it is way much faster than in the past (when it had to run in the script engine).

The blog here gives some measurements (which are both biased and for postgres instead of MySQL) but nevertheless it shows that adjacency lists do not have to be slow.

So my conclusion today is:

The simple adjacency list may be fast enough if the database engine supports recursion.
Do a benchmark with your own data and your own engine.
Do not trust outdated recommendations to point out the "best" method.

Exo Flame · Answer 7 · 2022-05-07T09:38:01.927

PGSQL Tree relations

Hello, I just got a handle on this for a project I'm working on and figured I'd share my write-up Hope this helps. Let's get started with some prereqs

This is essentially the closure table solution mentioned above Using recursive calls. Thanks for those slides they are very useful I wish i saw them before this write up :)

pre-requisites

Recursive Functions

these are functions that call themselves ie

function factorial(n) {
    if (n = 0) return 1; //base case
    return n * factorial(n - 1); // recursive call
}

This is pretty cool luckily pgsql has recursive functions too but it can be a bit much. I prefer functional stuff cte with pgsql

WITH RECURSIVE t(n) AS (
    VALUES (1) -- nonrecusive term 
  UNION ALL
    SELECT n+1 FROM t WHERE n < 100 -- recusive term
    --continues until union adds nothing
)
SELECT sum(n) FROM t;

The general form of a recursive WITH query is always a non-recursive term, then UNION (or UNION ALL), then a recursive term, where only the recursive term can contain a reference to the query's own output. Such a query is executed as follows:

Recursive Query Evaluation

Evaluate the non-recursive term. For UNION (but not UNION ALL), discard duplicate rows. Include all remaining rows in the result of the recursive query, and also place them in a temporary working table.
So long as the working table is not empty, repeat these steps:

a. Evaluate the recursive term, substituting the current contents of the working table for the recursive self-reference. For UNION (but not UNION ALL), discard duplicate rows and rows that duplicate any previous result row. Include all remaining rows in the result of the recursive query, and also place them in a temporary intermediate table.

b. Replace the contents of the working table with the contents of the intermediate table, then empty the intermediate table.

to do something like factorial in sql you need to do something more like this so post

ALTER FUNCTION dbo.fnGetFactorial (@num int)
RETURNS INT
AS
BEGIN
    DECLARE @n  int

    IF @num <= 1 SET @n = 1
    ELSE SET @n = @num * dbo.fnGetFactorial(@num - 1)

    RETURN @n
END
GO

Tree data structures (more of a forest :)

wikipedia

The import thing to note is that a tree is a subset of a graph, This can be simply enforced by the relationship each node has only one parent.

Representing the Tree in PGSQL

I think it will be easiest to work it out a little more theoretically before we move on to the sql

The simple way of represent a graph relation without data duplication is by separating the nodes(id, data) from the edges. We can then restrict the edges(parent_id, child_id) table to enforce our constraint. be mandating that parent_id,child_id as well as just child id be unique

create table nodes (
    id uuid default uuid_generate_v4() not null unique ,
    name varchar(255) not null,
    json json default '{}'::json not null,
    remarks varchar(255),
);


create table edges (
    id uuid default uuid_generate_v4() not null,
    parent_id uuid not null,
    child_id uuid not null,
    meta json default '{}'::json,
    constraint group_group_id_key
        primary key (id),
    constraint group_group_unique_combo
        unique (parent_id, child_id),
    constraint group_group_unique_child
        unique (child_id),
    foreign key (parent_id) references nodes
        on update cascade on delete cascade,
    foreign key (child_id) references nodes
        on update cascade on delete cascade
);

Note that theoretical this can all be done with only one table by simply putting the parent_id in the nodes table and then

CREATE VIEW v_edges as (SELECT id as child_id, parent_id FROM nodes)

but for the proposal of flexibility and so that we can incorporate other graph structures to this framework I will use the common many-to-many relationship structure. This will ideally allow this research to be expanded into other graph algorithms.

Let's start out with a sample data structure

INSERT (id, my_data) VALUES ('alpha', 'my big data') INTO nodes
INSERT (id, my_data) VALUES ('bravo', 'my big data') INTO nodes
INSERT (id, my_data) VALUES ('charly', 'my big data') INTO nodes
INSERT (id, my_data) VALUES ('berry', 'my big data') INTO nodes
INSERT (id, my_data) VALUES ('zeta', 'my big data') INTO nodes
INSERT (id, my_data) VALUES ('yank', 'my big data') INTO nodes

INSERT (parent_id, child_id) VALUES ('alpha', 'bravo') INTO edges
INSERT (parent_id, child_id) VALUES ('alpha', 'berry') INTO edges
INSERT (parent_id, child_id) VALUES ('bravo', 'charly') INTO edges
INSERT (parent_id, child_id) VALUES ('yank', 'zeta') INTO edges

-- rank0       Alpha      Yank
-- rank1    Bravo Berry     Zeta
-- rank2  Charly

Note the interesting properties of a tree (number of edges e) =( number of nodes n)-1 each child has exactly one parent.

We can then simplify the equations

let n = node
let p = parent 
let c = child
let ns = nodes = groups
let es = edges = group_group // because this is a relationship of a group entity to another group entity

So now what sort of questions will we ask.

"Given an arbitrary set of groups 's' what is the coverage of the graph assuming nodes inherit their children?"

This is a tricky question, it requires us to traverse the graph and find all children of each node in s

This continues off of this stack overflow post

        -- some DBMS (e.g. Postgres) require the word "recursive"
        -- some others (Oracle, SQL-Server) require omitting the "recursive"
        -- and some (e.g. SQLite) don't bother, i.e. they accept both
-- drop view v_group_descendant;
create view v_group_descendant as
with recursive descendants -- name for accumulating table
  (parent_id, descendant_id, lvl) -- output columns
as
  ( select parent_id, child_id, 1
    from group_group -- starting point, we start with each base group
  union all
    select d.parent_id, s.child_id, d.lvl + 1
    from descendants  d -- get the n-1 th level of descendants/ children
      join group_group  s -- and join it to find the nth level
        on d.descendant_id = s.parent_id -- the trick is that the output of this query becomes the input
        -- Im not sure when it stops but probably when there is no change
  )
select * from descendants;

comment on view v_group_descendant is 'This aggregates the children of each group RECURSIVELY WOO ALL THE WAY DOWN THE TREE :)';

after we have this view we can join with our nodes/groups to get out data back i will not provide these samples for every single step for the most part we will just work with ids.

select d.*, g1.group_name as parent, g2.group_name as decendent --then we join it with groups to add names
from v_group_descendant d, groups g1, groups g2
WHERE g1.id = d.parent_id and g2.id = d.descendant_id
order by parent_id, lvl, descendant_id;

sample output

+------------------------------------+------------------------------------+---+----------+---------+
|parent_id                           |descendant_id                       |lvl|parent    |decendent|
+------------------------------------+------------------------------------+---+----------+---------+
|3ef7050f-2f90-444a-a20d-c5cbac91c978|6c758087-a158-43ff-92d6-9f922699f319|1  |bravo     |charly   |
|c1529e8a-75b0-4242-a51a-ac60a0e48868|3ef7050f-2f90-444a-a20d-c5cbac91c978|1  |alpha     |bravo    |
|c1529e8a-75b0-4242-a51a-ac60a0e48868|7135b0c6-d59c-4c27-9617-ddcf3bc79419|1  |alpha     |berry    |
|c1529e8a-75b0-4242-a51a-ac60a0e48868|6c758087-a158-43ff-92d6-9f922699f319|2  |alpha     |charly   |
|42529e8a-75b0-4242-a51a-ac60a0e48868|44758087-a158-43ff-92d6-9f922699f319|1  |yank      |zeta     |
+------------------------------------+------------------------------------+---+----------+---------+

Note that this is just the minimal node descendant relationship and has actual lost all nodes with 0 children such as charly.

In order to resolve this we need to add all nodes back which don't appear in the descendants list

 create view v_group_descendant_all as (
       select * from  v_group_descendant gd
       UNION ALL
       select  null::uuid as parent_id,id as descendant_id, 0 as lvl from groups g
       where not exists (select * from  v_group_descendant gd where gd.descendant_id = g.id )
);
comment on view v_group_descendant is 'complete list of descendants including rank 0 root nodes descendant - parent relationship is duplicated for all levels / ranks';

preview
+------------------------------------+------------------------------------+---+----------+---------+
|parent_id                           |descendant_id                       |lvl|parent    |decendent|
+------------------------------------+------------------------------------+---+----------+---------+
|3ef7050f-2f90-444a-a20d-c5cbac91c978|6c758087-a158-43ff-92d6-9f922699f319|1  |bravo     |charly   |
|c1529e8a-75b0-4242-a51a-ac60a0e48868|3ef7050f-2f90-444a-a20d-c5cbac91c978|1  |alpha     |bravo    |
|c1529e8a-75b0-4242-a51a-ac60a0e48868|7135b0c6-d59c-4c27-9617-ddcf3bc79419|1  |alpha     |berry    |
|c1529e8a-75b0-4242-a51a-ac60a0e48868|6c758087-a158-43ff-92d6-9f922699f319|2  |alpha     |charly   |
|42529e8a-75b0-4242-a51a-ac60a0e48868|44758087-a158-43ff-92d6-9f922699f319|1  |yank      |zeta     |
|null                                |c1529e8a-75b0-4242-a51a-ac60a0e48868|0  |null      |alpha    |
|null                                |42529e8a-75b0-4242-a51a-ac60a0e48868|0  |null      |yank     |
+------------------------------------+------------------------------------+---+----------+---------+

Lets say for example we are getting our set s of groups bases on a users(id , data) table with a user_group(user_id, group_id) relation

We can then join this to another table removing duplicates because our set s of user_group relations may cause duplicates if a users is say assigned to both alpha assigned charly

+------+--------+
| user | group  |
+------+--------+
| jane | alpha  |
| jane | charly |
| kier | yank   |   
| kier | bravo  |
+------+--------+

--drop view v_user_group_recursive;
CREATE VIEW v_user_group_recursive AS (
SELECT DISTINCT dd.descendant_id AS group_id, ug.user_id 
    FROM v_group_descendant_all dd , user_group ug
    WHERE (ug.group_id = dd.descendant_id 
        OR ug.group_id = dd.parent_id)  -- should gic
);
SELECT * FROM v_user_group_recursive;

+------+--------+
| user | group  |
+------+--------+
| jane | alpha  |
| jane | bravo  |
| jane | berry  |
| jane | charly |
-- | jane | charly | Removed by DISTINCT
| kier | yank   |   
| kier | zeta   |   
| kier | bravo  |
| kier | charly |
+------+--------+

If we want we can now group by node and join we can do somthing k like the fallowing

CREATE VIEW v_user_groups_recursive AS (
    SELECT user_id, json_agg(json_build_object('id', id,'parent_id',parent_id, 'group_name', group_name, 'org_id', org_id, 'json', json, 'remarks', remarks)) as groups
    FROM v_user_group_recursive ug, v_groups_parent g
    WHERE ug.group_id = g.id GROUP BY user_id
);
comment on view v_user_group_recursive is 'This aggregates the groups for each user recursively ';

+------+-------------------------------+
| user | groups                        |
+------+-------------------------------+
| jane | [alpha, bravo, berry, charly] |
| kier | [yank, zeta, bravo, charly]   |   
+------+-------------------------------+

This is awesome we have answered the question. We now can simply ask which groups this use inherits

SELECT * from v_user_groups_recursive where user_id = 'kier

Displaying our hard work in the front end

And further we could use somthing like jstree.com to display our structure

  async function getProjectTree(user_id) {
        let res = await table.query(format('SELECT * from v_user_groups_recursive ug WHERE ug.user_id = %L', user_id));
        if (res.success) {
            let rows = res.data[0].groups.map(r => {

                return {
                    id: r.id, // required
                    parent: r.parent_id==null?'#':r.parent_id,// required
                    text: r.group_name,// node text
                    icon: 'P', // string for custom
                    state: {
                        opened: true,  // is the node open
                        disabled: false,  // is the node disabled
                        selected: false,  // is the node selected
                    },
                    li_attr: {},  // attributes for the generated LI node
                    a_attr: {}  // attributes for the generated A node
                }
            })
           
            return {success: true, data: rows, msg: 'Got all projects'}
        } else return res;
    }

<div id="v_project_tree" class="row col-10 mx-auto" style="height: 25vh"></div>
<script>
   function buildTree() {
      bs.sendJson('get', "/api/projects/getProjectTree").then(res => {
         bs.resNotify(res);
         if (!res.success) {
            //:(
            console.error(':(');
            return
         }
         console.log(res.data);
         $('#v_project_tree').jstree({
            'core': {
               'data': res.data
            }
         });
      })
   }
   window.addEventListener('load', buildTree);
</script>

jstree preview

blog

score 1 · Answer 8 · answered Apr 06 '11 at 16:29

The best way, I think indeed is to give each node an id and a parent_id, where the parent id is the id of the parent node. This has a couple of benefits

When you want to update a node, you only have to rewrite the data of that node.
When you want to query only a certain node, you can get exactly the information you want, thus having less overhead on the database connection
A lot of programming languages have functionality to transform mysql data into XML or json, which will make it easier to open up your application using an api.

score 0 · Answer 9 · answered Feb 01 '10 at 10:21

Something like table "nodes" where each node row contains parent id (in addition to the ordinary node data). For root, the parent is NULL.

Of course, this makes finding children a bit more time consuming, but this way the actual database will be quite simple.