Recursive CTE to walk up a network, but choosing path at junctions based on an attribute

Question

I am trying to trace a path in a river network graph by "walking" up it until a terminal node is reached. The problem is that when walking upstream, we encounter tributaries, or junctions, where our path can go multiple directions. I would like to specify the direction of the walk using an attribute of the links. I am trying to construct a CTE with postgres to achieve this, but cannot quite get there.

Here is a simple diagram of the network; reaches in red and nodes in blue:

Each reach has an area attribute. This graph can be constructed via

CREATE TEMP TABLE tree (
  id_reach integer PRIMARY KEY,
  from_node integer UNIQUE NOT NULL,
  to_node integer,
  area float
);

INSERT INTO tree(id_reach, from_node, to_node, area) VALUES
 (0, 1, 0, 40),
 (1, 3, 1, 29),
 (2, 2, 1, 5),
 (3, 4, 3, 7),
 (4, 5, 3, 6);

What I want to do

I would like to specify a node within this graph, then traverse upstream following the path of the reaches with the largest area. So for the example shown, the outputs would look as

 starting node |  path  |
---------------+---------
       0        {0, 1, 3}
       1        {1, 3}
       3        {3}

I cannot figure out how to incorporate the area information into the CTE. So far, the best I have come up with returns all the reaches upstream of the starting_node (which is to_node = 1 in the following):

WITH RECURSIVE us_path AS (
    SELECT id_reach,
        from_node,
        to_node
    FROM tree
    WHERE to_node = 1
    UNION ALL
    SELECT tr.id_reach,
        tr.from_node,
        tr.to_node
    FROM tree as tr
        JOIN us_path AS usp ON tr.to_node = usp.from_node
)
SELECT id_reach
FROM us_path;

This of course returns

which are all the id_reachs upstream.

How can I modify the CTE to only return a path that follows the id_reachs with the largest area?

What I have tried

I tried post-processing the results of the CTE above, but I could find no way to filter the results to return this path. I have tried inserting WHERE and HAVING statements into the CTE to try to select the reach with the largest area, but could not get them to work either.

Update

Some great answers have been given below, all of which have taught me something new. The accepted answer is the simplest and fastest for my use case, although the other answers provide mechanisms for generalizing the problem a bit.

Yes, very rarely, but in those cases choosing any of the ties would be acceptable. — Jon, Sep 02 '21 at 01:23

eshirvana · Answer 1 · 2021-09-07T19:59:17.437

here is how you can do it:

WITH RECURSIVE us_path AS (
    select *
    from (
        SELECT * , row_number() over (order by area desc) maxarea       
        FROM tree
        WHERE to_node = 0
    ) t where maxarea = 1 
    UNION ALL
    select *
    from (select tr.*,row_number() over (order by tr.area desc) maxarea
    FROM tree as tr
    join us_path AS usp on tr.to_node = usp.from_node
    ) t where maxarea = 1     
)
SELECT id_reach, area
FROM us_path;

or if you need it for all the posibble starting nodes:

WITH RECURSIVE us_path AS (
    select *
    from (
        SELECT to_node, from_node, id_reach,area , row_number() over (partition by to_node order by area desc) maxarea       
        FROM tree
    ) t where maxarea = 1 
    UNION ALL
    select * 
    from (select usp.to_node, tr.from_node,tr.id_reach,tr.area,row_number() over (partition by usp.to_node order by tr.area desc) maxarea
    FROM tree as tr
    join us_path AS usp 
        ON tr.to_node = usp.from_node
    ) t where maxarea = 1 
)
SELECT to_node, array_agg(id_reach), sum(area) total_area
FROM us_path
group by to_node
order by to_node

db<>fiddle here

Erwin Brandstetter · Answer 2 · 2021-09-10T14:52:58.840

There are many ways. Here are two elegant and fast ones:

1. Get next node with biggest `area` in a `LATERAL` subquery

WITH RECURSIVE us_path AS (
   -- start with biggest area for each node
   (  -- parentheses required
   SELECT DISTINCT ON (to_node)
          to_node AS starting_node, from_node, ARRAY[to_node, from_node] AS path
   FROM   tree
   ORDER  BY to_node, area DESC NULLS LAST
   )
   UNION ALL
   SELECT tr.*
   FROM   us_path usp
   CROSS  JOIN LATERAL (
      SELECT usp.starting_node, tr.from_node, usp.path || tr.from_node
      FROM   tree tr
      WHERE  tr.to_node = usp.from_node
      ORDER  BY tr.area DESC NULLS LAST
      LIMIT  1  -- chose biggest area at each junction
      ) tr
   )
SELECT DISTINCT ON (starting_node)
       starting_node, path
FROM   us_path
ORDER  BY starting_node, cardinality(path) DESC;

db<>fiddle here

Result for the given example:

 starting_node |   path    
---------------+-----------
             0 | {0,1,3,4}
             1 | {1,3,4}
             3 | {3,4}
(3 rows)

At each junction, an arbitrary row is picked from a set of peers that tie for the biggest area - as per your comment:

but in those cases choosing any of the ties would be acceptable.

To make the pick deterministic, add more expressions to each ORDER BY clause.

Since area can be NULL (not defined NOT NULL), add NULLS LAST in descending order, otherwise NULL sorts on top. See:

In the initial (non-recursive) term, from each set of edges with the same to_node, I only pick the one with the greatest area. This way, leave nodes with inferior area are never reached. Your desired result seems to suggest as much.

(
SELECT DISTINCT ON (to_node)
       to_node AS starting_node, from_node, ARRAY[to_node, from_node] AS path
FROM   tree
ORDER  BY to_node, area DESC NULLS LAST
)

Enclose in parentheses, because:

Combine multiple SELECT statements

In the outer SELECT I only pick the "last" row for each starting node, identified by the length of the array (cardinality(path))

WITH RECURSIVE us_path AS (
   -- start with biggest area for each node
   (  -- parentheses required
   SELECT DISTINCT ON (to_node)
          to_node AS starting_node, from_node, 1 AS step
   FROM   tree
   ORDER  BY to_node, area DESC NULLS LAST  -- start with biggest area for each node
   )
   UNION ALL
   SELECT usp.starting_node
        ,(SELECT tr.from_node
          FROM   tree tr
          WHERE  tr.to_node = usp.from_node
          ORDER  BY tr.area DESC NULLS LAST
          LIMIT  1)  -- chose biggest area at each junction
        , step + 1
   FROM   us_path usp      
   WHERE  usp.from_node IS NOT NULL
   )
SELECT starting_node, starting_node || array_agg(from_node)
FROM  (
   SELECT *
   FROM   us_path
   WHERE  from_node IS NOT NULL
   ORDER  BY starting_node, step
   ) sub
GROUP  BY 1;

db<>fiddle here

Same result.

This time we need to eliminate appended NULL values explicitly. Different from the CROSS JOIN above, a correlated subquery does not eliminate those.

I am not selecting this as the accepted answer as another answer provides the solution I asked for using minimal code, but I am awarding the bounty here as the comprehensiveness of this answer is helping me expand my understanding of how to code with postgresql. — Jon, Sep 10 '21 at 21:20

score 1 · Answer 3 · answered Sep 07 '21 at 15:27

This should work


with recursive pt as (
select distinct to_node as start_node,
                null::integer as to_node,
                first_value(from_node) over (partition by to_node order by area desc) as from_node,
                first_value(id_reach) over (partition by to_node order by area desc) as longest_path
    from tree
union all
select distinct pt.start_node as start_node,
                t1.to_node as to_node,
                first_value(t1.from_node) over (partition by t1.to_node order by area desc) as from_node,
                first_value(id_reach) over (partition by t1.to_node order by area desc) as longest_path
    from tree t1
    join pt on pt.from_node = t1.to_node
)
select pt.start_node, array_agg(longest_path) from pt group by pt.start_node

Gives the result:

start_node  path
0           {0,1,3}
1           {1,3}
3           {3}

The first query gives the starting points, using null as to_node in the starting_point guarantee that the row won't be chosen again.

I also make sure that each row contains the starting point, which makes it easy to group all the reaches in the final stage.

If you're not familiar with the first_value function, it just picks the first value found according to the over clause (the reach with the largest area within each to_node)

score 1 · Accepted Answer · answered Sep 09 '21 at 23:58

An answer by eshirvana uses row_number to pick one row per group.

Here is a somewhat shorter and maybe cleaner variant. In recursive part we need just one row, so LIMIT seems like a natural choice. Quite likely it will be a bit faster as well.

WITH RECURSIVE us_path AS (

    SELECT *
    FROM tree
    WHERE to_node = 0
        
    UNION ALL
    
    select * 
    from (
        select tr.*
        FROM tree as tr
        join us_path AS usp 
            ON tr.to_node = usp.from_node
        ORDER BY tr.area desc
        LIMIT 1
    ) t
    
)
SELECT id_reach, area
FROM us_path;

Recursive CTE to walk up a network, but choosing path at junctions based on an attribute

4 Answers4

1. Get next node with biggest `area` in a `LATERAL` subquery

2. Get next node in correlated subquery, aggregate in outer `SELECT`

Recursive CTE to walk up a network, but choosing path at junctions based on an attribute

4 Answers4

1. Get next node with biggest area in a LATERAL subquery

2. Get next node in correlated subquery, aggregate in outer SELECT

1. Get next node with biggest `area` in a `LATERAL` subquery

2. Get next node in correlated subquery, aggregate in outer `SELECT`