11

I'm using PostgreSQL's Ltree module for storing hierarchical data. I'm looking to retrieve the full hierarchy sorted by a particular column.

Consider the following table:

  votes | path  | ...
 -------+-------+-----
      1 | 1     | ...
      2 | 1.1   | ...
      4 | 1.2   | ...
      1 | 1.2.1 | ...
      3 | 2     | ...
      1 | 2.1   | ...
      2 | 2.1.1 | ...
      4 | 2.1.2 | ...
    ... | ...   | ...

In my current implementation, I'd query the database with SELECT * FROM comments ORDER BY path, which would return the whole tree:

Node 1
-- Node 1.1
-- Node 1.2
---- Node 1.2.1
Node 2
-- Node 2.1
---- Node 2.1.1
---- Node 2.1.2

However, I want to sort by votes (not by id, which is what sorting by path amounts to). Each depth level needs to be independently sorted, with the correct tree structure kept intact. Something that would return the following:

Node 2
-- Node 2.1
---- Node 2.1.2
---- Node 2.1.1
Node 1
-- Node 1.2
---- Node 1.2.1
-- Node 1.1

Postgres' WITH RECURSIVE might be appropriate, but I'm not sure. Any ideas?

Cœur
  • 37,241
  • 25
  • 195
  • 267
David Chouinard
  • 6,466
  • 8
  • 43
  • 61
  • Just curious what if path is getting bigger (let's say 12.32.11)... sorting by path is gonna fail... I'm pretty sure you faced that problem. Did you have come up with solution? – Alexander B. Jul 07 '20 at 00:09

2 Answers2

15

You were on the right track with WITH RECURSIVE.

Solution with recursive CTE

WITH RECURSIVE t AS (
    SELECT t.votes
         , t.path
         , 1::int AS lvl
         , to_char(t2.votes, 'FM0000000')  AS sort
    FROM   tbl t
    JOIN   tbl t2 ON t2.path = subltree(t.path, 0, 1)

    UNION ALL
    SELECT t.votes
         , t.path
         , t.lvl + 1
         , t.sort || to_char(t2.votes, 'FM0000000')
    FROM   t
    JOIN   tbl t2 ON t2.path = subltree(t.path, 0, t.lvl + 1)
    WHERE  nlevel(t.path) > t.lvl
    )
SELECT votes, path, max(sort) AS sort
FROM   t
GROUP  BY 1, 2
ORDER  BY max(sort), path;

Major points

  • The crucial part is to replace every level of the path with the value of votes. Thereby we assemble one column we can ORDER BY at the end. This is necessary, because the path has an unknown depth and we cannot order by an unknown number of expressions in static SQL.

  • In order to get a stable sort, I convert votes to a string with leading zeroes using to_char(). I use seven digits in the demo, which works for vote values below 10.000.000. Adjust according to your maximum vote count.

  • In the final SELECT I exclude all intermediary states to eliminate duplicates. Only the last step with max(sort) remains.

  • This works in standard SQL with a recursive CTE, but is not very efficient for large trees. A plpgsql function that recursively updates the sort path in a temporary table without creating temporary dupes might perform better.

  • Only works with the additional module ltree installed, which provides the functions subltree() and nlevel(), as well as the ltree data type.

My test setup, for review convenience:

CREATE TEMP TABLE tbl(votes int, path ltree);
INSERT INTO tbl VALUES
  (1, '1')
, (2, '1.1')
, (4, '1.2')
, (1, '1.2.1')
, (3, '2')
, (1, '2.1')
, (2, '2.1.1')
, (4, '2.1.2')
, (1, '2.1.3')
, (2, '3')
, (17, '3.3')
, (99, '3.2')
, (10, '3.1.1')
, (2345, '3.1.2')
, (1, '3.1.3')
;

PL/pgSQL table function doing the same

Should be faster with huge trees.

CREATE OR REPLACE FUNCTION f_sorted_ltree()
  RETURNS TABLE(votes int, path ltree)
  LANGUAGE plpgsql VOLATILE AS
$func$
DECLARE
   lvl integer := 0;
BEGIN
   CREATE TEMP TABLE t ON COMMIT DROP AS
   SELECT tbl.votes
        , tbl.path
        , ''::text AS sort
        , nlevel(tbl.path) AS depth
   FROM   tbl;

   -- CREATE INDEX t_path_idx ON t (path);   -- beneficial for huge trees
   -- CREATE INDEX t_path_idx ON t (depth);

   LOOP
      lvl := lvl + 1;

      UPDATE t SET sort = t.sort || to_char(v.votes, 'FM0000000')
      FROM  (
         SELECT t2.votes, t2.path
         FROM   t t2
         WHERE  t2.depth = lvl
         ) v
      WHERE  v.path = subltree(t.path, 0 ,lvl);

      EXIT WHEN NOT FOUND;
   END LOOP;

   -- Return sorted rows
   RETURN QUERY
   SELECT t.votes, t.path
   FROM   t
   ORDER  BY t.sort;
END
$func$;

Call:

SELECT * FROM f_sorted_ltree();

Read in the manual about setting temp_buffers.

I would be interested which performs faster with your real life data.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • Thanks for the *amazing* reply. This is a quite clever solution. I'm setting up a clone of the production database to test various approaches and I'll let you know. – David Chouinard Jan 10 '12 at 14:39
  • 1
    Your second solution is about twice as fast as the first on my data. My trees aren't very deep (max 5-6 levels) and have <100 elements. I'm not seeing much of a differences in performance between both approaches on my larger trees (still ~2x faster). However, I'm attempting to sort `votes` in descending order, but ordering `sort` by `DESC` messes up the tree. Any thoughts there? – David Chouinard Jan 14 '12 at 17:43
  • When I run the function on the test setup you provided, I get the error: `structure of query does not match function result type` `Detail: Number of returned columns (4) does not match expected column count (2).` – Un3qual Jul 04 '17 at 05:23
  • @Un3qual: Seems impossible with `RETURN QUERY SELECT t.votes, t.path` - obviously just *2* columns. – Erwin Brandstetter Jul 04 '17 at 05:54
  • @ErwinBrandstetter I thought so too. The only thing that I changed was that I removed the `x.` from the name of the function. The rest of the error is `CONTEXT: PL/pgSQL function f_sorted_ltree() line 30 at RETURN QUERY` – Un3qual Jul 04 '17 at 06:12
  • 1
    @Un3qual: The `x.` was a debug leftover. More importantly, I had a typo in the function name: `fsorted_ltree()` instead of `f_sorted_ltree`. You may have two functions with slightly different names - and be looking at the wrong one to debug. Check: `SELECT * FROM pg_proc WHERE proname ~ 'sorted_ltree'`. – Erwin Brandstetter Jul 04 '17 at 14:48
0
create table comments (
  id serial,
  parent_id int,
  msg text,
  primary key (id)
);

insert into comments (id, parent_id, msg) values (1, null, 'msg 1');
insert into comments (id, parent_id, msg) values (2, null, 'msg 2');
insert into comments (id, parent_id, msg) values (3, 1, 'msg 1 / ans 1');
insert into comments (id, parent_id, msg) values (4, null, 'msg 3');
insert into comments (id, parent_id, msg) values (5, 2, 'msg 2 / ans 1');
insert into comments (id, parent_id, msg) values (6, 2, 'msg 2 / ans 2');
insert into comments (id, parent_id, msg) values (7, 2, 'msg 2 / ans 3');

desc

WITH RECURSIVE q AS
(
  SELECT  id, msg, 1 as level, ARRAY[id] as path
  FROM  comments c
  WHERE parent_id is null
  UNION ALL
  SELECT  sub.id, sub.msg, level + 1, path || sub.id
  FROM  q
    JOIN  comments sub
      ON  sub.parent_id = q.id
)
SELECT  id, msg, level
FROM    q
order by path || array_fill(100500, ARRAY[8 - level]) desc;

results in

4,"msg 3",1
2,"msg 2",1
7,"msg 2 / ans 3",2
6,"msg 2 / ans 2",2
5,"msg 2 / ans 1",2
1,"msg 1",1
3,"msg 1 / ans 1",2

asc

WITH RECURSIVE q AS
(
  SELECT  id, msg, 1 as level, ARRAY[id] as path
  FROM  comments c
  WHERE parent_id is null
  UNION ALL
  SELECT  sub.id, sub.msg, level + 1, path || sub.id
  FROM  q
    JOIN  comments sub
      ON  sub.parent_id = q.id
)
SELECT  id, msg, level
FROM    q
--order by path || array_fill(100500, ARRAY[8 - level]) desc;
order by path;

results in

1,"msg 1",1
3,"msg 1 / ans 1",2
2,"msg 2",1
5,"msg 2 / ans 1",2
6,"msg 2 / ans 2",2
7,"msg 2 / ans 3",2
4,"msg 3",1
Penkov Vladimir
  • 921
  • 5
  • 10