I've got a problem which I can solve for small data sets but fails on large ones with (possibly) unclean data.
The database is an implementation of an acyclic (hopefully) graph in PostgreSQL. With three tables
vertex_elements: id
edges: id, parent_id, child_id
element_associations: id, user_id, object_id (both are vertex elements, but it unconnected graphs)
I have a set of user_ids from which I derive element_associations and a starting vertex_element in the graph, and I want to find all child nodes are accessible from an element_association with one of the user_ids. A node is considered to be accessible if it, or one of its ancestors is one of the candidate object_ids of an element_association.
The graph is relatively triangular in shape (few root nodes with many leaf nodes), and from a starting vertex element, my strategy is the following:
- Check the current vertext_element against the list of candidate element_associations; if it's good, all descendants are accessible, otherwise go to...
- Check if the ancestors of the current vertex_element are in the list of candidate element_associations. Similarly to (1), if it's a hit, then all ancestors are accessible, otherwise go to...
- Traverse to each child vertex_element (breadth first search) and follow steps 1 & 2.
The issue arises when I want to avoid double-checking the same ancestor vertex_elements. The main query is a traversal downwards, checking each descendant's accessibility with a set of candidate element_associations
WITH RECURSIVE edges_recursive(child_id, parent_id, matching_element_association_id) AS (
(
SELECT e1.child_id, e1.parent_id, ea.id
FROM edges e1
LEFT OUTER JOIN element_associations ea ON e1.child_id = ea.object_id
AND ea.id IN (?)
WHERE parent_id = ?
)
UNION
(
SELECT e2.child_id, e2.parent_id, ea.id
FROM edges e2
INNER JOIN assignments_recursive
ON edges_recursive.child_id = e2.parent_id
LEFT OUTER JOIN element_associations ea
ON edges_recursive.child_id = ea.object_id
AND ea.id IN (?)
WHERE edges_recursive.matching_element_association_id IS NULL
)
)
SELECT edges_recursive.child_id
FROM edges_recursive
WHERE edges_recursive.matching_element_association_id IS NOT NULL
However, there is an addition recursive subquery which checks each vertex_element's within the LEFT OUTER JOIN element_associations which looks like
ea.id IN (
WITH RECURSIVE parent_edges_recursive(child_id, parent_id, matching_element_association_id) AS (
(
SELECT edges.child_id, edges.parent_id, ea.id
FROM edges
LEFT OUTER JOIN element_associations ea
ON ea.id IN (?) AND edges.parent_id = ea.object_id
WHERE edges.child_id = e1.parent_id AND edges.parent_id != e1.parent_id
)
UNION
(
SELECT edges.child_id, edges.parent_id. ea.id
FROM edges
JOIN parent_edges_recursive
ON parent_edges_recursive.parent_id = edges.child_id
LEFT OUTER JOIN element_associations ea
ON ea.id IN (?) AND edges.parent_id = ea.object_id
WHERE parent_edges_recursive.matching_element_association_id IS NULL
)
SELECT parent_edges_recursive.matching_element_association_id
FROM parent_edges_recursive
WHERE parent_edges_recursive.matching_element_association_id IS NOT NULL
LIMIT 1
)
)
The issue with this is that there is that the subqueries tend to avoid traversing the same parent vertex twice; however, there is no guarantee that as we traverse down the graph through the descendants that we won't retread previously evaluated ancestors. For small data sets, this is fine and the performance is okay; however, it's ridiculously unscalable, and is extremely unresilient to cycles.
What I need to do is preserve the information about what parent vertex_elements I have already traversed between subqueries so that I avoid retreading steps; however, I am stuck on how to do this in a singular query.