1

I've got a problem which I can solve for small data sets but fails on large ones with (possibly) unclean data.

The database is an implementation of an acyclic (hopefully) graph in PostgreSQL. With three tables

vertex_elements: id
edges: id, parent_id, child_id
element_associations: id, user_id, object_id (both are vertex elements, but it unconnected graphs)

I have a set of user_ids from which I derive element_associations and a starting vertex_element in the graph, and I want to find all child nodes are accessible from an element_association with one of the user_ids. A node is considered to be accessible if it, or one of its ancestors is one of the candidate object_ids of an element_association.

The graph is relatively triangular in shape (few root nodes with many leaf nodes), and from a starting vertex element, my strategy is the following:

  1. Check the current vertext_element against the list of candidate element_associations; if it's good, all descendants are accessible, otherwise go to...
  2. Check if the ancestors of the current vertex_element are in the list of candidate element_associations. Similarly to (1), if it's a hit, then all ancestors are accessible, otherwise go to...
  3. Traverse to each child vertex_element (breadth first search) and follow steps 1 & 2.

The issue arises when I want to avoid double-checking the same ancestor vertex_elements. The main query is a traversal downwards, checking each descendant's accessibility with a set of candidate element_associations

        WITH RECURSIVE edges_recursive(child_id, parent_id, matching_element_association_id) AS (
          (
            SELECT e1.child_id, e1.parent_id, ea.id
            FROM edges e1
            LEFT OUTER JOIN element_associations ea ON e1.child_id = ea.object_id
              AND ea.id IN (?)
            WHERE parent_id = ?
          )
          UNION
          (
            SELECT e2.child_id, e2.parent_id, ea.id
            FROM edges e2
            INNER JOIN assignments_recursive
            ON edges_recursive.child_id = e2.parent_id
            LEFT OUTER JOIN element_associations ea 
            ON edges_recursive.child_id = ea.object_id
              AND ea.id IN (?)
            WHERE edges_recursive.matching_element_association_id IS NULL
          )
        )

        SELECT edges_recursive.child_id
        FROM edges_recursive
        WHERE edges_recursive.matching_element_association_id IS NOT NULL

However, there is an addition recursive subquery which checks each vertex_element's within the LEFT OUTER JOIN element_associations which looks like

ea.id IN (
  WITH RECURSIVE parent_edges_recursive(child_id, parent_id, matching_element_association_id) AS (
    (
      SELECT edges.child_id, edges.parent_id, ea.id
      FROM edges
      LEFT OUTER JOIN element_associations ea 
      ON ea.id IN (?) AND edges.parent_id = ea.object_id
      WHERE edges.child_id = e1.parent_id AND edges.parent_id != e1.parent_id
    )
    UNION
    (
      SELECT edges.child_id, edges.parent_id. ea.id
      FROM edges
      JOIN parent_edges_recursive 
      ON parent_edges_recursive.parent_id = edges.child_id
      LEFT OUTER JOIN element_associations ea
      ON ea.id IN (?) AND edges.parent_id = ea.object_id
      WHERE parent_edges_recursive.matching_element_association_id IS NULL
    )
    SELECT parent_edges_recursive.matching_element_association_id
    FROM parent_edges_recursive
    WHERE parent_edges_recursive.matching_element_association_id IS NOT NULL
    LIMIT 1
 )
)

The issue with this is that there is that the subqueries tend to avoid traversing the same parent vertex twice; however, there is no guarantee that as we traverse down the graph through the descendants that we won't retread previously evaluated ancestors. For small data sets, this is fine and the performance is okay; however, it's ridiculously unscalable, and is extremely unresilient to cycles.

What I need to do is preserve the information about what parent vertex_elements I have already traversed between subqueries so that I avoid retreading steps; however, I am stuck on how to do this in a singular query.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
fpcyan
  • 11
  • 1
  • 3

1 Answers1

2

What I need to do is preserve the information about what parent vertex_elements I have already traversed between subqueries so that I avoid retreading steps;

Without studying your queries in detail: you can do that by collecting IDs in an array. Code examples:

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • You are absolutely correct that within a single ancestor subquery I can collect the IDs in an array; however I don't think I can pass the IDs collected in the parent traversal between subqueries, so I could possibly retread steps between. You've neatly identified the crux of the issue -- I want to share that data between each execution of the subquery to avoid repetitions. – fpcyan Feb 09 '18 at 17:55